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ABSTRACT 


This  report  makes  an  initial  attempt  at  presenting  a 
coherent  approach  to  the  design  and  analysis  of  file 
structures.  The  relative  efficiency  of  different  file 
implementations  is  discussed  as  a  function  of  usage 
statistics.  The  fundamental  differences  between  item 
and  descriptor-organized  files  are  discussed  in  terms 
of  input-output  requirements.  The  report  concludes 
with  a  discussion  of  batching,  buffering  and  con¬ 


currency. 
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INTRODUCTION 


In  this  report  we  make  an  initial  attempt  at  presenting 
a  coherent  approach  to  the  design  and  analysis  of  file 
structures.  The  work  incorporates  concepts  and  techniques 
developed  under  Contract  AF30 (602) -3324  and  AF30 (602)-4211 
and  previously  described  in: 


(1)  Information  System  Theory  Project:  Volume  ly 

-Theory.  Anatol  W.  Holt,  et  al.  November 
1965.  AD  626-819. 

(2)  Information  System  Theory  Project,  The  Nature 
of  FFS:  An  Experiment  in  » -Theoretic  Analysis. 
The  Staff  of  Project  ISTP.  March  1966. 

(3^  Information  System  Theory  Project,  Final  Report. 
Anatol  W.  Holt,  et  al.  September  1968, 

AD  676-972. 


File  design  at  present  is  a  primitive  art.  Starting  with 
an  inadequate  definition  of  the  problem,  the  designer 
attempts  to  find  an  economical  representation  of  the 
problem  on  some  computing  complex.  The  relative  efficiency 
of  different  file  implementatioi.s  depends  critically  upon 
usage  statistics,  so  that  in  the  course  of  mapping  the 
problem  into  the  computing  milieu,  all  the  usage  statistics 
are  in  effect  assigned  values  by  virtue  of  the  implementation. 
These  implicit  and  accidental  assignments  are  rarely,  if  ever, 
stated.  The  implemented  system  levies  a  cost  penalty  on  all 
deviations  from  the  implicit  statistics.  The  system  rarely 


2 


P 


has  any  capability  for  collecting  the  actual  statistics 
in  the  course  of  being  used;  it  generally  has  no  facility 
for  adjusting  itself  to  significant  deviations. 

In  document  retrieval  systems  the  objective  should  be  to 
perform  all  of  the  functions  required  by  the  system, 
including  storage,  update,  and  retrieval,  with  minimum 
cost.  No  meaningful  measure  of  cost  can  be  generated 
without  taking  into  consideration  the  frequency  of  occurrence 
of  each  of  the  functions  and  the  way  in  which  the  computing 
milieu  performs  these  functions.  A  knowledge  of  which 
statistics  about  the  application  are  worth  collecting,  and 
of  vfhich  characteristics  of  the  representation  on  the 
computing  milieu  are  particularly  critical  or  sensitive  to 
the  statistics,  makes  it  possible  to  design  a  'system 
generator'  which  produces  an  initial  system  tailored  to 
what  is  known  ab  initio  and  adaptive  in  respect  to  what 
can  be  discovered  ex  post  facto. 

Consider  a  systetr,  consisting  of  the  following  four  sub~ 
systems  s 

1.  A  file  of  documents  with  each  document  uniquely 
named.  The  functions  in  this  subsystem  include: 

"  retrieval  (input  a  document  name;  output  a 
copy  of  the  document) ; 


-  add  (input  a  document;  output  a  document  name) ; 
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-  delete  (input  a  document  name;  optionally 
output  a  copy  of  the  document;  and  in  any  case 
prevent  the  execution  of  any  function  which  has 
the  document  ncune  as  an  input  until  the  execution 
of  cm  add  function  tvith  the  document  name  as 
output) ; 

■  update  {input  a  document  name  and  a  set  of 
revisions;  output  a  document  name). 

2.  A  file  of  descriptors.  The  functions  in  this 
subsystem  include; 

-  entry  (input  a  descriptor;  if  not  previously 
encountered  make  up  an  internal  representation; 
output  internal  representation) ; 

-  query  (input  a  descriptor;  output  internal 
representation) . 

3.  A  file  of  descriptor-document  relations.  The 
functions  in  this  subsystem,  include: 

-  query  (input  a  descriptor  (internal  representation] ; 
output  a  set  of  document  names  [all  of  those 
documents  to  which  this  descriptor  applies]). 

4.  User/ control  subsystem.  The  functions  in  this 
subsystem  include; 

-  store  (input  a  document  v/ith  descriptors  for 
that  document;  output  a  completion  signal) ; 

“  retrieve  (input  a  list  of  descriptors  combined 
by  and  and  or  logic;  output  copies  of  those 


documents  which  satisfy  the  retrieval  specification) . 

The  focal  point  of  our  study  will  be  the  organization  and 
utilization  of  Subsystem  3,  the  file  of  descriptor-document 
relations.  We  start  with  an  abstract  model  of  cross- 
indexin*''  (sometimes  referred  to  as  coordinate  indexing)  . 

Feature  ard  files  and  edge-notched  card  files  are  disc^ussed 
in  relation  to  this  eibstraction.  Indirect  and  superimposed 
coding  techniques  are  explained  in  this  context,  and  their 
efficiency  is  related  to  usage  statistics  and  hardware 
characteristics.  We  then  evolve  a  theoretical  measure  of 
the  input-output  requirements  of  Subsystem  3,  v.'iich  we 
characterize  as  the  volume  of  cross-indexing  information. 

The  fundamental  ditferences  between  item-  and  descriptor- 
organized  files  are  discussed  in  terms  of  this  measure. 

Formulae  are  derived  for  calculating  the  average  volume 
transacted  with,  as  a  function  of  usage  statistics  and 
technique  of  file  organization. 

The  report  then  translates  the  previously  developed  concepts 
into  the  context  of  computer  implementation.  A  concrete 
excimple  of  the  application  of  the  volumetric  formulae  is 
provided  by  a  study  of  PDQ,  an  information  retrieval  system 
operational  on  IBM  System  360  hardware.  The  effects  of 
batching  (a  usage  statistic)  on  the  formulae  are  examined. 

This  leads  to  an  important  asymmetry  from  the  user's  point 
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of  view  and  suggests  a  particular  design  for  cross¬ 
indexing  in  an  interactive  hardware/software  milieu.  We 
tiien  present  some  alternative  representational  techniques 
applicable  in  the  computer  context.  This  permits  us  to 
derive  formulae  analogous  to  the  volumetric  formulae 
presented  previously  and  applicable  to  the  three  commonly 
found  organizations  for  cross-indexing  files:  item- 
sequenced,  inverted,  and  list-structured  files. 

We  then  discuss  the  problem  of  performing  list  intersection, 
a  calculation  which  is  of  critical  importance  in  inverted- 
list  file  organizations.  A  new  technique  for  this  operation 
is  developed  and  compared  with  several  existing  techniques. 
Another  adaptively  applicable  representation  of  inverted 
lists  is  discussed.  Next,  a  method  for  defining  graph 
representations  of  file  structures  is  presented.  The 
methodology  of  the  report  is  then  applied  in  a  critique 
of  'balanced  trees',  a  component  of  the  Multilist  system. 

An  alternative  'decoding'  technique  is  presented,  and  use 
of  secondary  storage  is  discussed. 

We  conclude  with  a  discussion  oi  batching,  buffering,  and 
concurrency,  explicated  by  the  use  of  Petri  Net  models. 

Both  hardware  and  software  models  are  constructed,  including 
the  National  Cash  Register  CRAM  card  random  access  memory 
device  and  a  nighly  concurrent  version  of  the  abstract 
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cross-indexing  model  presented  in  the  first  section  of 
the  report. 


I 


I.  A  Model  of  Cross-Indexin' 


We  will  use  as  the  basis  for  our  discussion  the  model  of 
cross- indexing  in  Figure  I-l.  This  model  consists  of  a 
grid  representing  a  file  of  item-descriptor  relations. 

The  horizontal  lines  (rows)  are  labelled  I2 ; 

they  correspond  uniquely  to  the  items  (i.e.,  documents, 
records,  etc.)  currently  in  the  system.  The  vertical 
lines  (columns)  are  labelled  dj^,  02,..., d^  ;  they 
correspond  uniquely  to  the  descriptors.  Each  descriptor 
applies  to  some  subset  of  the  items  in  the  file,  and  some 
subset  of  the  descriptors  applies  to  each  item.  These 
relations  are  represented  in  the  model  by  circled  inter¬ 
sections:  each  intersection  in  the  grid  is  either  circled 

or  uncircled;  a  given  intersection  j,k  is  circled  if  and 

only  if  descriptor  d.  applies  to  item  I,  . 

A  query  or  retrieval  request  consists  of  a  list  of 
descriptors:  the  response  to  such  a  query  consists  of  a 

list  of  all  items  to  which  all  descriptors  in  the  query 
apply.  Thus,  in  terms  of  the  grid  model,  a  query  consists 
in  the  selection  of  some  subset  Q  of  the  set  of  all  d's 
The  response  consists  in  a  readout  of  the  members  of  some 
subset  R  of  the  set  of  all  I's.  Any  given  item  is 

a  member  of  R  if  and  only  if,  for  all  q  such  that  d^ 

is  a  member  of  Q  ,  q,r  is  circled. 
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f  The  operations  in  this  system  required  for  processing 
queries  and  updating  the  file  will  include:  descriptor 
selection,  response  readout,  circling  of  intersections, 
uncircling  of  intersections,  addition  of  (horizontal  and/or 
vertical)  lines,  and  deletion  of  (horizontal  and/or  vertical) 
lines.  We  will  use  this  model  as  a  frame  of  reference  for 
the  excimination  and  comparison  of  various  extant  indexing 
techniques.  We  will  begin  by  considering  feature  (peeJcaboo) 
cards  and  edge-notched  cards,  which  represent  relatively 
straightforward  implementations  of  two  basic  types  of 
file  organization. 


I 
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II.  Feature  Cards 


A  file  of  item-descriptor  relations  may  be  implemented  as 
a  deck  of  feature  cards.  In  the  most  straightforward 
implementation,  each  card  in  the  deck  corresponds  uniquely 
to  one  descriptor.  Each  card  contains  a  grid-like  array 
of  card-positions ,  each  of  which  is  either  punched  out 
or  not.  (Figure  II-l  shows  a  feature  card  with  400 
positions;  they  have  been  numbered  from  left  to  right  and 
from  top  to  bottom  so  that  each  position  is  uniquely  named.) 
Each  item  in  the  file  is  assigned  a  unique  card-position 
—  the  same  one  on  every  card  in  the  deck.  (Thus  we  could 
assign  item  1^^  card  position  k  on  each  card  in  a  dec!;.) 

A  given  position  on  a  given  card  is  punched  out  if  and 
only  if  the  descriptor  represented  by  that  card  applies  to 
the  item  represented  by  that  position.  Thus  the  cards  in 
a  deck  correspond  to  the  columns  in  our  grid  model;  the 
card-positions  correspond  to  the  rows;  and  punched  out 
card-positions  correspond  to  circled  intersections. 

A  query  is  performed  by  selecting  a  subset  of  the  cards  in 
the  deck;  namely,  those  cards  which  represent  the  descriptors 
in  the  query.  The  selected  cards  are  lined  up  on  top  of  one 
another  and  placed  in  front  of  a  light  source.  The  positions 
through  which  light  is  visible  —  i.e.,  those  positions  which 
are  punched  out  on  every  card  in  the  query  set  —  identify 
the  items  which  satisfy  the  query. 


H  Note  that  with  feature  cards  it  is  relatively  easy  to  add 
another  descriptor  to  the  system  —  by  simply  adding  another 
card  to  the  deck.  It  is  also  easy  to  add  another  item  to 
the  system  —  until  the  number  of  items  is  equal  to  the 
number  of  positions  on  a  card.  At  that  point  addition  of 
items  to  the  file  requires  che  creation  of  an  additional 
deck.  New  items  may  then  be  assigned  card-positions  in 
the  new  deck.  (If,  as  in  Figure  II-l,  there  are  400 
positions  on  a  card  and  the  positions  on  the  first  deck 
have  been  named  by  numbering  them  1  through  400,  then  the 
card-positions  of  cards  in  the  second  deck  would  be 
ntimbered  401  through  800,  and  so  forth.)  Thus  the  number 
of  feature  card  decks  —  and  therefore  the  number  of 
operations  necessary  to  perform  one  query  —  will  be  equal 
to:  (the  n’omber  of  items)/(the  number  of  card-positions), 

rounded  up.  Deletion  of  an  item  from  the  system  is 
relatively  difficult  since,  although  it  is  easy  to  punch 
out  a  hole  (i.e.,  circle  an  intersection),  it  is  difficult 
to  fill  in  a  hole  (i.e.,  uncircle  an  intersection).  Deletion 
of  an  item  might  involve  reproducing  —  without  the  hole  in 
the  position  representing  the  item  to  be  deleted  —  the 
feature  card  for  each  descriptor  whicli  applied  to  the  item. 
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III.  Edge-Notched  Cards 

A  file  of  item-descriptor  relations  may  also  be  implemented 
as  a  deck  of  edge-notched  cards.  Each  card  in  the  deck 
corresponds  uniquely  to  one  item  in  tne  system.  Each  card 
contains  a  set  of  card-positions;  a  row  of  holes  along 
its  margin.  Each  hole  may  be  notched  out  (i.e.,  the 
material  separating  the  hole  from  the  edge  of  the  card 
is  cut  away)  or  not.  (Figure  III-l  shows  an  edge-notched 
card  with  37  positions,  numbered  from  left  to  right.)  In 
the  most  straightforward  implementation,  each  descriptor 
in  the  system  is  assigned  a  unique  card-position  —  the 
same  one  on  every  card  in  the  deck.  (Thus  we  could  assign 
descriptor  dj^  card-position  k  on  each  card  in  a  deck. ) 

A  given  position  on  a  given  card  is  notched  out  if  and 
only  if  the  descriptor  represented  by  that  position  applies 
to  the  item  represented  by  that  card.  Thus  the  cards  in 
a  deck  correspond  to  the  rows  in  our  grid  model;  the  card- 
positions  correspond  to  the  col  .nns;  and  notched  out  card- 
positions  correspond  to  circled  intersections. 

A  query  is  made  oy  selecting  a  subset  of  the  card  positions. 
The  deck  is  lined  up  and  a  sorting  needJe  is  inserted 
through  each  hole  which  represents  a  descriptor  in  the  query. 
The  needles  are  then  raised  and  jiggled  so  that  those  cards 
which  have  every  needled  hole  notched  out  drop  from  the 
deck.  The  subset  of  cards  which  drop  represents  the  subset 
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of  items  to  which  all  descriptors  in  the  query  apply. 

Note  that  with  edge-notched  cards  it  is  relatively  easy  to 
add  descriptors  to  the  system  until  '.he  number  of  descriptors 
is  equal  to  the  number  of  positions  cn  a  card.  At  that 
point  it  becomes  difficult  to  add  further  descriptors. 
Starting  another  deck  will  not  entirel/  solve  the  problem 
since  coordinated  searches  may  not  be  satisfied  in  either 
deck  by  itself.  ^  It  is  easy  to  add  anothex*  item  to  the 
system  —  by  simply  adding  auiother  card  to  the  deck. 

However/  once  the  number  of  cards  (i.e.,  items)  exceeds  the 
capacity  of  a  sorting  needle/  the  sort  operation  cm  no 
longer  be  performed  in  a  single  step.  The  number  of 
operations  necessary  to  perform  one  query  will  be  equal 
to;  (the  number  of  items)/ (the  capacity  rl  a  sorting 
needle)/  rounded  up.  This  iS/  of  course,  similar  to  the 
formula  for  the  number  of  operations  necessary  to  perform 
a  query  with  feature  cards  —  the  capacity  of  a  sorting 
needle  corresponds  to  the  number  of  positions  on  a  feature 
card.  It  iS/  of  course,  easy  to  delete  an  iuem  from  the 
system,  in  contrast  to  feature  cards.  Finally,  it  is  easy 
to  circle  an  intersection  (i.e.,  notch  out  a  hole)  but 
difficult  to  uncircle  an  intersection. 


^An  overflow  technique  is  described  in  ISTP  Edge- 
Notched  Card  System  (A  Manual  for  the  Information  System 
Theory  Proiect) .  Holt,  Anatol  W.  Applied  Data  Research,  Inc. 
Princeton,  N.J.  February  1964. 


Figure  III-l 
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IV.  Indirect  Coding 


The  method  of  encodement  described  in  the  preceding  section, 
in  which  a  card  position  uniquely  represents  a  single 
descriptor,  is  called  direct  coding.  The  mechanics  of 
edge-notched  sorting  and  the  physical  properties  of 
materials  for  cards,  needles,  and  so  forth,  set  a  practical 
limit  to  the  number  cf  hole  positions  that  can  be  provided 
on  a  margin.  (There  are  a  number  of  different  edge-notched 
card  formats  available,  and  the  number  of  positions  on  a 
card  varies  from  several  dozen  to  several  hundred.)  This 
means  that  if  direct  coding  is  used,  the  capacity  of  the 
card  will  quickly  be  exceeded  (i.e.,  the  number  of  des¬ 
criptors  will  be  greater  than  the  number  of  card-positions) 
in  all  but  the  most  primitive  systems.  We  have  already 
noted  that  the  cost  penalty  is  considerable.  Indirect 
coding  is  a  type  of  descriptor-encodement  which  avoids 
this  cost  penalty  by  using  a  relatively  small  number  of 
card-positions  to  represent  a  relatively  large  number  of 
descriptors . 

The  fundamental  statistical  assumption  which  justifies 
indirect  coding  is  the  following;  although  there  may  be  a 
very  large  number  of  different  descriptors  in  a  data  base, 
only  a  very  small  subset  of  these  descriptors  will  apply 
to  any  one  item  in  the  data  base.  In  terms  of  our  grid 
model,  cne  number  of  circled  intersections  on  any  one 
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horizontal  line  will  be  very  small  relative  to  the  total 
number  of  descriptors  in  the  system. 

In  order  to  consider  the  most  straightforward  type  of 
indirect  coding,  let  us  make  a  further  assumption:  the 
set  of  descriptors  in  the  system  contains  subsets  of 
mutually  exclusive  descriptors  —  that  is,  no  two  members 
of  the  same  subset  ever  apply  to  the  same  item.  (For 
example,  suppose  that  the  system  is  a  personnel  file  and 
each  item  represents  an  individual.  Since  each  human  being 
has  exactly  one  birthdate,  the  age  values  would  form  a 
subset  of  mutually  exclusive  descriptors.)  Indirect 
coding  can  then  be  utilized  as  follows:  the  card-positions 
are  partitioned  into  subsets.  Each  such  subset  of  card- 
positions  is  used  to  encode  one  set  of  mutually  exclusive 
descriptors. 

Suppose,  for  example,  that  we  wish  to  represent  a  set  of 
six  mutually  exclusive  descriptors,  which  we  can  name 
dj^,  d2  ...  d^  ,  with  four-hole  positions,  named  hj^,  h2, 
h^,  and  h^  .  We  might  encode  the  descriptors  as  follows: 


:  notch 

out 

hi 

and 

h2 

:  notch 

out 

hi 

and 

<^3 

:  notch 

out 

hi 

and 

h4 

^34 

:  notch 

out 

h2 

and 

^3 
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notch  out  and 

notch  out  h_  and  h 
J  4 

If  d^  occurred  in  a  query,  holes  h^  and  h2  would  be 
needled;  if  d2  occurred  h^  and  h^  would  be  needled, 
and  so  forth.  The  capacity  C  of  a  subset  of  card- 
positions  (i.e.,  the  number  of  mutually  exclusive  descriptors 
encodable  in  a  subset  of  the  holes)  is  determined  by  the 
number  of  hole  positions  H  in  the  subset  and  the  number 
of  hole  positions  P  used  to  encode  each  of  the  descriptors. 

C  =  H  (H-1)  (H-2)  .  .  .  (H-P+l)/P!  =  (p) 

C  is  a  maximum  for  P  =  H/2,  rounded  up  or  down. 

This  corresponds  to  a  statement  in  information  theory:  the 
encodement  which  makes  the  most  efficient  use  of  a  medium  is 
an  encodement  which  results  in  a  probability  of  .5  that 
the  value  of  any  given  bit  will  be  1.  With  edge-notched 
cards  the  probability  of  any  given  hole  being  notched  out 
is  optimally  .5  . 

Indirect  coding  greatly  increases  the  total  number  of 
descriptors  which  can  be  represented.  For  example,  if  we 
were  using  an  edge-notched  card  format  with  100  holes, 
direct  coding  would  limit  the  system  to  100  descriptors. 

However,  if  we  could  group  the  descriptors  into  10  subsets 
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each  of  whose  members  were  mutually  exclusive,  then  we 
might  represent  with  the  same  card  format 

lOxc  =  10x252  (for  H=10  and  P=5)  =  2520  descriptors 

in  the  system. 

This  type  of  indirect  coding  obviously  depends  upon  the 
posjibility  of  grouping  the  descriptors  into  subsets 
whose  members  are  mutually  exclusive.  The  effectiveness 
of  the  grouping  is  a  function  of  the  number  of  descriptors 
in  each  subset.  In  the  worst  case  no  two  descriptors  in 
the  data  base  would  be  mutually  exclusive.  Then  we  would 
in  effect  be  forced  back  on  direct  coding:  since  the 
number  of  subsets  would  have  to  equal  the  number  of  des¬ 
criptors,  there  would  only  be  one  descriptor  in  each  group, 
and  consequently  we  would  need  one  hole  position  for  each 
descriptor.  In  fact,  in  a  typical  data  base  some  descriptors 
can  be  usefully  grouped  into  such  subsets  while  others 
cannot.  In  the  next  section  we  will  discuss  another  type 
of  indirect  ceding  —  superimposed  coding  —  which  is 
useful  in  such  situations  but  which  introduces  another 
cost  component. 

Another  difficulty  frequently  arises  in  the  use  of  subsets 
of  mutually  exclusive  descriptors.  We  often  do  not  have 
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enough  Information  about  the  data  base  to  group  the 
descriptors  ab  initio.  Nevertheless,  with  edge-notched 
cards  we  must  allocate  holes  when  the  system  is  created, 
and  this  cannot  be  done  without  deciding  the  number  of 
subsets  and  the  capacity  of  each.  Furthermore,  regrouping 
the  descriptors  ex  post  facto  requires  statistical 
calculations,  the  notching  out  of  additional  holes,  and, 
worst  of  all,  the  filling  in  of  notches.  The  only 
alternative  to  filling  in  notches  is  to  replace  every 
affected  item  card  with  a  new,  appropriately  punched 
card  —  which,  of  course,  will  involve  redundant  notching. 

Indirect  coding  can  be  employed  in  our  grid  model  as 
follows:  vertical  lines  no  longer  correspond  one-one  to 
descriptors;  instead,  the  vertical  lines  are  partitioned 
into  subsets,  each  of  which  corresponds  to  a  set  of 
mutually  exclusive  descriptors,  and  so  forth.  This 
immediately  suggests  that  we  can  utilize  indirect  coding 
with  feature  cards  as  well.  Since  vertical  lines  in  the 
grid  model  correspond  to  cards  in  a  feature  card  deck,  we 
would  proceed  as  follows:  instead  of  representing  each 
descriptor  with  one  card,  we  partition  the  deck  into  sub¬ 
sets  of  cards,  letting  each  such  subset  correspond  to  one 
subset  of  mutually  exclusive  descriptors,  and  so  forth. 

In  an  edge-notched  card  system,  indirect  coding  allows 
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us  to  represent  many  more  descriptors  with  a  given  number 
of  hole  positions.  Analogously,  indirect  coding  allows  us 
to  represent  many  more  descriptors  with  a  given  number  of 
feature  cards.  In  an  edge-notched  system  we  are  in  effect 
forced  to  use  indirect  coding  since  it  is  impossible  to 
increase  the  number  of  hole  positions;  whereas  in  a 
feature  card  system,  indirect  coding  is  used  in  order  to 
reduce  the  number  of  cards  required.  In  both  systems  the 
density  of  circled  intersections  (i.e.,  hole  punching  or 
notching)  is  increased.  This  density  approaches  an 
optimum  at  . 5  . 

Indirect  coding  results  in  an  increase  in  the  number  of 
circled  intersections  (because  each  instance  of  a 
descriptor  applying  to  an  item  may  now  be  represented  by 
several  circled  intersections  instead  of  only  one) .  Thus 
in  both  edge-notched  card  and  feature  card  systems  we  must 
take  into  account  the  increased  cost  of  punching  or 
notching  each  time  a  new  item  is  added.  Furthermore,  when 
a  query  is  performed,  many  more  vertical  lines  must  be 
selected:  in  an  edge-notched  system  more  needles  must  be 
inserted  into  the  deck;  in  a  feature  card  system  a  larger 
number  of  cards  must  be  selected  and  lined  up  in  front  of 
the  light  source.  In  other  words,  indirect  coding  also 
leads  to  certain  cost  increases  for  retrieval,  and  these 
too  must  be  taken  into  account.  (Later,  when  we  consider 
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computerized  implementations,  relative  costs  will  vary 
significantly  from  both  feature  emd  edge-notched  systems, 
a  fact  which  will  help  explain  why  techniques  effective  in 
one  hardware  milieu  are  totally  inappropriate  in  another 
hardware  milieu.) 

The  reduction  in  the  number  of  vertical  lines  which  is 
made  possible  by  indirect  coding  has  further  cost  implications. 
By  reducing  the  number  of  cards  in  a  feature  card  deck,  we 
reduce  the  time  required  to  locate  any  cne  card  (for  example, 
when  performing  a  query).  Analogously,  by  reducing  the 
number  of  hole  positions  used  in  an  edge-notched  system,  we 
reduce  the  time  necessary  to  locate  any  one  particular  hole. 
This  cost  reduction  will  be  even  more  pronounced  in  the 
computer  context.  In  general,  any  reduction  in  memory  bulk 
reduces  the  amount  of  time  required  to  locate  a  particular 
entry.  Therefore,  indirect  coding  will  increase  the  number 
of  vertical  lines  to  be  selected  (i.e.,  the  number  of  cards 
to  be  selected  in  a  feature  card  system  or  the  number  of 
holes  to  be  needled  in  an  edge-notched  system)  in  performing 
a  query,  but  it  will  also  reduce  the  cost  of  locating  a 
particular  vertical  line  (feature  card,  hole)  by  reducing 
the  total  number  of  vertical  lines. 

Feature  cards  are  inherently  more  flexible  than  edge- 
notched  cards  when  indirect  coding  is  introduced  ex  post  facto. 
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The  operation  is  one  of  replacing  a  large  number  of 
feature  cards  which  are  pairwise  mutually  exclusive  (i.e., 
no  two  cards  have  any  punched  out  card-position  in  common) 
with  a  much  smaller  set  of  cards.  (If  n  is  the  number 
of  cards  required  to  represent  the  set  of  mutually  exclusive 
descriptors  by  direct  coding,  then  n'  ,  the  number  of 
cards  necessary  with  indirect  coding,  is  the  smallest 
integer  such  that  larger  than  n  .)  All 

feature  cards  not  included  in  the  subset  are  unaffected. 

In  an  edge-notched  system,  however,  item  cards  to  be 
replaced  may  have  other  descriptors  associated  with  them, 
and  the  notches  for  these  descriptors  must  be  replicated 
on  the  replacement  cards.  In  other  words,  the  introduction 
of  indirect  coding  ex  post  facto  will  require  more  circling 
and  uncircling  operations  in  an  edge-notched  system. 


24 


V.  Superimposed  Coding 

Superimposed  coding  is  another  type  of  indirect  descriptor 
encodement  which  allows  the  system  to  have  many  more 
descriptors  than  there  are  vertical  lines.  Again  it 
should  be  true,  that  although  there  may  be  a  very  large 
number  of  descriptors  in  the  data  base,  only  a  relatively 
very  small  number  of  the  descriptors  will  apply  to  any 
one  item.  In  another  respect,  however,  superimposed 
coding  differs  radically  from  the  indirect  coding  method 
described  in  the  preceding  section:  instead  of  working 
most  effectively  when  descriptors  can  be  grouped  into 
subsets  whose  members  are  mutually  exclusive,  superimposed 
coding  is  most  effective  when  there  is  no  correlation 
whatever  between  descriptors  —  that  is,  when  probabilities 
of  co-occurrence  are  entirely  random.  Superimposed  coding 
is  therefore  advantageous  when  nothing  is  known  about  the 
descriptor  population  ab  initio  and  when  the  cost  of 
grouping  descriptors  ex  post  facto  is  high.  Superimposed 
coding  is  clearly  advantageous  when  it  is  known  of  the 
descriptor  population  that  co-occurrence  probabilities 
are  random. 

Let  us  first  apply  superimposed  coding  to  our  grid  model. 

We  will  view  all  the  vertical  lines  as  a  single  field.  Each 
descriptor  is  encoded  by  a  small  number  of  vertical  lines 
chosen  at  random,  but  such  that  no  two  descriptors  have  the 
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same  code.  Suppose,  for  example,  that  some  descriptor 

d.  is  encoded  by  vertical  lines  j,  k,  and  1  .  If  d. 

1  1 

applies  to  some  item  ,  then  intersections  (j,h)  , 

(k,h)  ,  and  (l,h)  will  be  circled. 

In  an  edge-notched  implementation,  then,  all  the  holes  on 
a  card  are  regarded  as  a  single  field.  Each  descriptor  is 
encoded  by  notching  a  small  number  of  hole  positions,  chosen 
at  random  but  such  that  no  two  descriptors  have  the  same 
code.  Note,  however,  that  two  different  descriptor  codes 
may  have  one  or  more  hole  positions  in  common.  Moreover, 
since  we  have  made  no  assinnptions  about  exclusivity,  two 
descriptor  codes  nay  overlap  on  the  same  item  card.  Con¬ 
sequently,  superimposed  coding  creates  the  possibility  of 
false  drops  v/hen  queries  are  performed.  Suppose,  for 
example,  that  descriptor  d^^  has  been  encoded  by  notches 
at  hole  positions  h^^  and  hj^Q  ,  d^  by  notches  at 
and  h^  ,  and  d^  by  notches  at  h^  and  hj^Q  .  Suppose 
further  that  d^^  and  d2  both  apply  to  item  I^  but  that 
d^  does  not  apply  to  now  perform  a  query 

containing  only  the  descriptor  d^  .  Needles  will  be 
inserted  through  holes  h^  and  ,  and  the  card 

representing  item  I^  will  be  among  those  cards  which  drop 
from  the  deck,  despite  the  fact  that  d^  does  not  apply 
to  I  . 
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False  drops  may  be  limited  by  controlling  certain  statistics 
when  an  edge-notched  system  is  designed.  The  critical 
factors  are:  the  anticipated  maximum  number  of  items  in 
the  system,  the  anticipated  niimber  of  descriptors  applicable 
to  each  item,  the  anticipated  number  of  descriptors  in  each 
query,  and  the  cost  of  handling  false  drops.  An  optimal 
design  will  guarantee  that  the  average  number  of  notched 
hole  positions  per  item  is  less  than  half  the  total  number 
of  hole  positions  (again,  the  probability  that  a  given 
hole  position  is  notched  optimally  approaches  . 5  ) .  To 
satisfy  this  requirement  the  following  should  hold; 

d  X  P  is  less  than  .69H  , 

I 

where  d^  =  the  average  number  of  descriptors  per  item,  and 
P  =  the  number  of  hole  positions  used  to  encode  a  descriptor, 
and  H  =  the  total  number  of  hole  pos  tions.  Where  this  is 
satisfied,  the  false  drop  rate  will  be  less  than  .5^  , 
where  q  =  the  number  of  hole  positions  specified  by  a  query. ^ 
Deviation  from  these  statistics  will  of  course  degrade  the 
performance  of  such  a  system.  If  the  estimate  of  the  number 
of  descriptors  per  item  is  too  low,  the  false  drop  rate  will 
be  higher.  (An  item  card  to  which  too  many  descriptors 


^See  Calvin  .Mooers,  "Zatocoding  for  Punched  Cards", 
Zator  Technical  bulletin  30,  1950.  pp.  14-19. 


apply  will  become  a  "problem  card"  —  it  will  have  so  many 
holes  notched  out  that  it  will  drop  out  for  virtually  every 
query.)  This  situation  can  be  dealt  with  by  using  over¬ 


flow  techniques.^  If,  on  the  other  hand,  the  estimate  is 
too  high,  punching  density  will  be  low  and  space  will  have 
been  wasted.  Tf  the  number  of  descriptors  in  a  typical 
request  is  lower  tt.in  estimated,  the  false  drop  rate  will 
increase.  It  will,  of  course,  also  increase  if  the  co¬ 
occurrence  probabilities  for  descriptors  are  not  random. 

It  is  extremely  important  to  assess  correctly  the  cost  of 
handling  false  drops.  In  an  edge-notched  system  each 
item  card  will  presvimably  contain  a  list  of  the  descriptors 
which  apply  to  it.  The  human  user  can  therefore  easily 
sort  out  the  false  drops  at  a  cost  which  may  be  insignificant 
relative  to  the  total  cost  of  performing  a  retrieval.  False 
drops  in  other  systems,  such  as  feature  card  or  computerized 
systems,  may  be  much  more  expensive  to  handle.  Let  us 
consider  the  problem  in  the  context  of  feature  card  systems. 
It  is  clear  that  we  can  translate  a  grid  model  system 
implemented  with  superimposed  coding  into  an  equivalent 
feature  card  system.  Each  descriptor  then  corresponds  to 
some  small  subset  of  feature  cards  in  a  deck;  the  subsets 
are  chosen  randomly  but  in  such  a  way  that  no  two  descriptors 


^Holt,  op  cit. 
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are  represented  by  the  Scune  subset,  and  so  forth.  The 
same  statistical  criteria  for  optimal  system  design  will 
still  apply,  but  the  cost  factors  may  be  quite  different. 

For  example,  a  "false  drop"  will  mean  tha*  when  the  set  of 
query  cards  is  placed  in  front  of  the  light  source,  light 
shines  through  some  card  position  representing  an  item  which 
in  fact  does  not  satisfy  the  query.  With  edge-notched  cards 
the  descriptors  could  be  listed  on  each  item  card,  but  with 
feature  cards  the  medium  does  not  permit  us  to  write  a  list 
of  applicable  descriptors  at  each  card  position.  As  a 
result,  false  drops  cannot  be  detected  without  going  to 
some  other  file  —  perhaps  to  the  items  themselves.  (The 
National  Bureau  of  Standards'  microcite  system  succeeds  in 
avoiding  this  problem.)  This  might  mean  that  false  drops 
would  not  be  detected  until  completion  of  some  relatively 
expensive  operation  whose  cost  is  a  linear  function  of  the 
total  number  of  drops,  including  false  drops.  In  some 
systems  this  might  be  the  predominant  cost  factor. 


VI.  Combined  Coding  Techniques 


In  the  preceding  sections  three  coding  methods  have  been 
discussed:  direct  coding,  indirect  coding  of  sets  of 
mutually  exclusive  descriptors,  and  superimposed  coding. 

These  three  techniques  are  not  mutually  exclusive,  and  in 
fact,  systems  have  been  designed  which  employ  all  three.' 
Descriptors  which  apply  to  a  large  fraction  of  the  data 
base  should  be  coded  directly.  (For  example,  if  a  system 
included  50  descriptors  each  of  which  applied  to  approximately 
half  of  the  items,  they  would  co-occur  frequently,  and  on  the 
average  25  cf  them  would  apply  to  each  item;  consequently  any 
multiple-circling  encodement  would  be  extremely  inefficient 
for  these  descriptors.)  Sets  of  descriptors  which  are 
mutually  exclusive  should  be  appropriately  grouped  and 
coded  indirectly.  Descriptors  v;hich  apply  to  only  a  small 
fraction  of  the  data  base  and  whose  co-occurrence  probabilities 
are  random  should  be  encoded  with  superimposed  coding,  pro¬ 
vided  that  the  cost  of  handling  false  drops  is  low  enough. 


'See,  for  example,  Holt,  op  cit. 
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VII.  Is  Retrieval  Time  a  Linear  Function 
of  the  Size  of  the  Data  Base? 

An  error  frequently  encountered  in  discussions  of  cross- 
indexed  retrieval  operations  is  the  claim  that  the  time 
required  to  perform  a  search  in  a  cross-indexed  system 
does  not  increase  linearly  with  the  number  of  items  in  the 
data  base  —  whereas  in  systems  which  perform  an  item- 
sequenced  serial  search,  time  is  manifestly  a  lineair 
function  of  the  number  of  items  in  the  data  base,  so  that: 

T  =  C  X  I  , 

where  T  =  the  total  search  time  per  query,  C  is  some 
constant,  and  I  =  the  number  of  items  in  the  system.  In 
fact,  in  both  edge-notched  and  feature  card  systems,  search 
time  is  a  step  function  of  the  number  of  items  in  the  system. 
We  have  already  expressed  this  fact  in  the  formulae  for  the 
number  of  operations  necessary  to  perform  a  query  in  the  two 
types  of  system.  The  critical  factors  were  respectively  the 
capacity  of  a  sorting  needle  and  the  number  of  positions  on 
a  feature  card.  Thus  search  time  in  these  systems  is  in 
fact  a  step  function  approximation  to  T  =  C  x  i  .  of 
course,  if  the  capacity  of  the  system  is  restricted  to  lie 
within  the  first  step,  it  will  appear  that  the  time  required 
to  perform  a  retrieval  is  independent  of  the  number  of  items 
in  the  data  base.  Naturally,  the  constant  C  and  the  number 
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of  items  covered  by  a  step  may  vary  radically  from  system 
to  system. 

An  unlikely  exception  to  the  above  remarks  would  be  the 
following:  a  continually  growing  data  base  in  which  the 
rate  of  introduction  of  new  descriptors  remains  constant. 
Under  such  circumstances  the  average  number  of  items  to 
which  a  descriptor  applies  would  not  grow  proportionally 
with  the  data  base.  An  inverted  (or  list-structured)  file 
organization  implemented  on  a  computer  might  take  advantage 
of  such  a  situation.  Even  then,  eventually  a  step  function 
would  emerge,  but  the  number  of  items  covered  by  the  first 
step  might  be  enormous. 

One  can,  of  course,  prevent  retrieval  time  from  increasing 
proportionally  with  the  size  of  the  data  base  simply  by 
adding  processing  capability  —  for  instance,  another 


person  or  computer. 
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VIII.  The  Volume  of  Cross-Indexing  Information 

A  cross-indexing  retrieval  operation  involves  a  number  of 
selections  in  a  hierarchical  structure  and  the  sensing, 
manipulation,  and  transmission  of  a  volume  of  bits. 
Regardless  of  the  particular  file-structuring  employed, 
this  volume  of  bits  remains  constant  for  a  given  data  base. 
That  is,  the  raw  information  content  of  the  volume  is 
simply  the  total  set  of  item-descriptor  relations.  There 
are  many  ways  in  which  this  material  may  be  organized  with 
consequent  cost  variations.  Some  methods  of  organization, 
when  combined  with  relevant  usage  statistics,  permit 
further  interesting  variations  on  the  manner  of  storage  of 
this  volume  of  bits  —  which  also  result  in  cost  variations. 
Note  that  the  amount  of  space  used  in  the  host  system  may 
vary,  but  the  volume  of  bits  representing  cross-indexing 
information  must  remain  constant. 

With  edge-notched  card  systems  the  volume  of  bits  is 
represented  by  the  entire  card  deck.  Since  each  retrieval 
involves  manipulation  of  the  whole  deck,  edge-notched  card 
retrieval  requires  transmission  of  the  total  volume  of 
cross-indexing  information.  With  feature  cards  the  volume 
of  bits  is  once  again  represented  by  the  entire  card  deck, 
but  each  retrieval  involves  manipulation  of  only  a  subset 
of  the  deck;  namely,  those  feature  cards  which  formulate 
the  query.  In  another  sense,  however,  the  entire  feature 
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card  deck  is  involved  in  retrieval:  the  selection  of  a 
particular  feature  card  requires  choosing  from  the  total 
deck  the  desired  card.  If  the  feature  cards  were  organized 
randomly  with  no  access  method  other  than  linear  scan,  we 
would  have  to  look  at  the  labels  on  approximately  half  the 
feature  cards  to  locate  any  particular  card.  However,  the 
selection  of  a  feature  card  is,  as  has  been  shown  previously, 
analogous  to  the  selection  of  a  hole  position  in  an  edge- 
notched  deck. 

To  compare  usefully  the  volume  of  bits  transacted  with  in 
the  two  systems,  we  will  therefore  view  the  performance  of 
a  query  as  a  two-stage  process:  (1)  the  selection  of  the 
subset  of  vertical  lines  (in  the  grid  model)  v;hich  represents 
the  query;  (2)  the  selection  of  the  subset  of  horizontal 
lines  which  satisfy  the  query  (i.e.,  those  lines  which  are 
circled  at  every  intersection  with  a  member  of  the  query 
set).  With  respect  to  the  first  stage  —  selection  of  a 
subset  of  the  cards  in  the  feature  card  system,  or  selection 
of  a  subset  of  the  hole  positions  in  the  edge-notched 
system  —  the  two  systems  are  equivalent.  With  respect  to 
the  second  stage,  however,  they  are  not:  in  the  edge- 
notched  system  we  must  still  transact  with  the  total 
volume  of  bits  (i.e.,  with  the  entire  deck);  in  the  feature 
card  system  we  need  only  deal  with  a  subset  of  those  bits 
(i.e.,  with  the  query  cards). 
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IX.  A  Fundamental  Difference  between  Item- 
and  Descriptor-Organized  Files 


Let  us  expand  our  comparison  of  edge-notched  card  systems 
and  feature  card  systems  to  a  more  general  comparison  of 
two  basic  types  of  cross-indexed  file  organization.  Those 
files  which,  like  edge-notched  systems,  are  sequenced  by 
item  we  call  item-organized  files;  files  which,  like 
feature  card  systems,  are  sequenced  by  descriptor  we  call 
descriptor-organized  (or  inverted)  files. 

In  general,  to  perform  a  retrieval  in  an  item-organized 
file  the  entire  volume  of  item-descriptor  information  must 
be  transacted  with;  whereas  in  a  descriptor-organized  file 
only  a  subvolume  of  the  item-descriptor  information  need 
be  transacted  with.  Needless  to  say,  the  cost  significance 
of  this  fact  is  highly  dependent  upon  the  interaction 
between  usage  statistics,  hardware  characteristics,  and 
representational  technique.  For  the  sake  of  illustration, 
consider  the  following  possibilities: 

(1)  A  retrieval  request  which  includes  a  large  proportion 
of  the  descriptor  population.  In  this  situation  the 
volumetric  reduction  is  negligible.  Fortunately,  for  a 
large  spectrum  of  information  retrieval  problems,  only  a 
tiny  fraction  of  the  descriptor  set  applies  to  any  one  item. 
Hence  a  retrieval  request  which  included  a  large  number  of 
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descriptors  would  not  be  satisfied  by  any  item  in  the 
collection.  Therefore,  within  this  spectrum  retrieval 
requests  will, on  the  average,  include  only  a  small  fraction 
of  the  descriptor  population.  However,  when  in  the  sequel 
we  discuss  batching,  this  situation  will  once  again  be  of 
interest. 

(2)  A  retrieval  requ^^^-^t  to  which  a  large  proportion  of 
the  Items  apply..  Ir  situation  the  volumetric 

reduction  in  th'>  c?  >  .  n  lex  inq  operation  is  significant, 

but  the  size  of  * n*  r  'nsc  imnlies  that  at  some  stage 
we  will  be  deal  mu  most  of  the  items  anyway.  Under 

such  circumstances  that  aspect  of  the  indexing  operation 
which  involves  the  descr lotor- i tern  pairs  is  likely  to  be 
almost  ms igni ! I  cant  in  terms  of  the  total  cost  of  the 
operation,  and  other  factors  which  may  or  may  not  depend 
upon  the  inverted  organization,  may  dominate  the  cost. 

Once  again,  for  a  large  spectrum  of  information  retrieval 
problems  (but  not  quite  so  large  as  that  cited  in  the 
previous  paragraph)  the  number  of  items  satisfying  a 
request  will,  on  the  average,  be  quite  small  relative  to 
the  total  item  population.  However,  when  in  the  sequel  we 
discuss  batching,  this  situation  too  will  once  again  be 
of  interest. 
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X.  A  Second  Fundamental  Difference  between 
Item-  and  Dqsrripcor-Organized  Files 

Given  the  symmetry  between  items  and  descriptors  suggested 
by  our  grid  model,  the  fact  that  the  inverted  file 
organization  facilitates  retrieval  would  imply  that  some 
other  operation  should  be  facilitated  in  an  item-organized 
file.  Queries  are  stated  in  terms  of  descriptors  and 
partition  the  indexing  information  by  descriptor.  Loosely 
speaking,  that  is  why  the  inverted  organization  is  potentially 
advantageous  in  performing  retrievals.  Additions  and 
deletions  of  items,  on  the  other  hand,  group  the  indexing 
information  by  item;  here  we  can  expect  some  advantage  from 
an  item-organized  file. 

What  is  at  issue  is  the  volume  of  bits  that  must  be 
transacted  with.  In  an  item-organized  file,  to  add  or 
delete  an  item  (i.e.,  to  perform  an  update)  we  must 
manipulate  a  volume  which  is  l/I  of  the  total  volume 
(where  I  =  the  total  number  of  items  in  the  system) .  In 
an  inverted  organization,  we  must  in  some  sense  manipulate 
d/D  of  the  total  volxime  (where  d  =  the  number  of  des¬ 
criptors  that  apply  to  the  item,  and  D  =  the  total  number 
of  descriptors  in  the  system) .  Since  for  a  wide  spectrum  of 
systems  the  nur.iber  of  items  is  much  larger  than  the  number 
of  descriptors,  and  d  is  greater  than  or  equal  to  one,  an 
update  in  this  spectrum  involves  a  smaller  subvolume  of 
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the  item-descriptor  information  if  the  file  organization 
is  by  item  rather  than  by  descriptor. 

Therefore, in  the  absence  of  batching  (which  we  will  discuss 
in  the  sequel) ,  and  without  introducing  the  critical 
effects  of  usage  statistics  and  hardware  characteristics 
(except  on  a  very  general  level), we  can  make  some 
comparative  remarks  about  the  volume  of  cross-indexing 
information  which  must  be  handled  to  perform  retrievals 
and  updates  in  the  two  types  of  file  organization. 
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XI.  Formulae  for  the  Volume  of  Bits  Transacted  with 


(a) 


V  *  I  X  D 


where  V  is  the  total  volume  in  bits 
I  is  the  number  of  items 
D  is  the  nximber  of  descriptors 


(b) 


where  V  ^ 
rD 


=(3  /D)V 
rD  r  r 

is  the  average  subvolume  for  retrieval  in 
a  descriptor-organized  file 
is  the  average  number  of  descriptors  in 
a  retrieval  request 


(c) 

where 


is  the  average  subvolume  for  retrieval  in 
an  item-organized  file 


(d) 


where  V  _ 
UD 


d.. 

1 


is  the  average  subvolume  for  update  in 
a  descriptor-organized  file 
is  the  average  number  of  descriptors  for 
an  item 


(e) 


where  V  _ 
ul 


=  (1/I)V  =  D 

ifa  the  average  subvolume  for  update  in 
an  item-organized  file 
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XII.  Soroe  Comments  on  the  Volumetric  Formulae 

These  results  must  be  interpreted  with  care,  since  hardware 
costs  have  not  been  introduced.  For  instance,  when  updating 
a  feature  card  it  is  perfectly  true  that  a  large  number  of 
bits  are  being  dealt  with  (the  set  of  positions  on  a 
feature  card)  hut  the  cheapness  and  density  of  the  medium 
may  make  this  operation  inexpensive.  The  relative  volumes 
are  not  real  cost  factors,  but  only  a  theoretical  measure 
of  the  amount  of  information  to  be  transacted  with.  These 
results  become  more  interesting  when  we  move  from  edge- 
notched  card  systems  and  feature  card  systems  (two 
different  hardware  milieus,  with  different  costs  for 
transacting  with  the  same  volxime  of  information)  to  a 
digital  computer  capable  of  employing  either  file 
organization  and  using  the  same  hardware  to  transact  with 
the  bit  volume,  so  that  the  theoretical  measure  is  converted 
into  a  useful  comparison. 
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XIII.  Computer  Implementations  of 
the  Cross-Indexing  Operation 

At  this  point  it  should  be  intuitively  clear  that  the 
gr:d  model  provides  a  general  framework  for  discussing 
cross-indexing.  We  have  so  far  discussed  two  general 
classes  of  extant  systems,  based  on  feature  cards  and 
edge-notched  cards.  Both  of  these  use  cross-indexing, 
and  we  have  established  the  relationship  between  each  of 
them  and  the  grid  model.  We  will  now  do  the  same  for  a 
third  class  of  systems,  based  on  the  use  of  digital 
computers.  Because  the  characteristics  of  stored-program 
digital  computers  permit  flexible  utilization  of  hardware 
resources,  the  number  of  significantly  different  and 
interesting  representational  techniques  for  cross-indexing 
is  far  greater  than  in  either  edge-notched  card  or  feature 
card  systems,  where  a  combination  of  direct,  indirect,  and 
superimposed  coding  techniques  virtually  exhausts  the. 
possibilities.  Hence  our  discussion  of  computer-based 
techniques  is  of  necessity  incomplete;  we  will  concentrate 
on  highlighting  the  similarities  and  differences  between 
a  few  computer  techniques  and  the  systems  already  discussed. 
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XIV.  Computer-Implemented  Inverted 
File  Organization 

A  straightforward  computer  implementation  of  an  inverted 
file  would  consist  of  a  set  of  lists,  one  for  each 
descriptor  in  the  system.  There  are  a  number  of  possible 
organizational  schemes  for  the  lists  themselves,  the 
follov/ing  being  the  simplest:  the  bits  in  each  list  are 
numbered  sequentially;  bit  j  in  any  given  list  refers 
to  item  I .  .  Hence  each  list  is  exactly  I  bits  long 
(where  I  =  the  number  of  items  in  the  system) .  Each  bit 
has  a  value  of  either  0  or  1  .  A  given  bit  in  a 
given  list  is  a  1  if  and  only  if  the  descriptor  associated 
with  that  list  applies  to  the  item  associated  with  that 
bit  position  in  the  list.  The  lists,  then,  correspond  to 
the  vertical  lines  in  our  grid  model;  bit  positions  in  the 
lists  correspond  to  horizontal  lines;  and  bits  with  value 
1  correspond  to  circled  intersections. 

A  retrieval  request  is  made  by  selecting  a  subset  of  the 
lists  and  intersecting  them.  Specifically,  the  lists  which 
correspond  to  the  descriptors  in  the  query  are  selected 
and  from  these  a  response  list  is  calculated  as  follows: 
a  given  bit  j  in  the  response  list  has  value  1  if  and 
only  if  bit  j  of  every  query  list  has  value  1  .  The 
bits  with  value  1  in  the  response  list  identify  the 
items  which  satisfy  the  query.  Addition  of  a  descriptor 
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to  the  system  is  accomplished  by  creating  a  new  list. 

The  addition  of  an  item  requires  lengthening  every  list 
by  one  bit.  An  intersection  is  circled  by  setting  a  bit 
to  1  ;  an  intersection  is  uncircled  by  setting  a  bit  to 
0  .  The  cost  of  performing  these  operations  may  vary 
enormously  as  a  function  of  characteristics  of  the  memory 
medium.  If  the  lists  are  stored  in  a  bit-addressable 
ferrite  core  memory,  the  cost  of  changing  the  value  of  a 
bit  will  be  trivial.  If  the  lists  are  stored  on  magnetic- 
tape,  it  may  be  necessary  to  read  and  write  all  the  lists 
in  order  to  change  the  value  of  a  single  bit. 

Such  a  computer- implemented  inverted  file  is  of  course 
similar  to  a  feature  card  system  employing  direct  coding. 
Indexing  information  is  grouped  by  descriptor  and  not  by 
item.  Position  within  a  group  (i.e.,  card-position  on  a 
feature  card  or  bit  position  in  a  list)  identifies  the 
item  to  which  a  descriptor  applies.  Consequently,  most 
of  our  conclusions  about  the  comparative  ease  or  difficulty 
of  various  operations  in  feature  card  systems  are  valid  for 
computer-implemented  inverted  file  systems  as  well.  However, 
the  relative  costs  may  be  quite  different  and  will  vary 
enormously  according  to  the  details  of  a  computer 
implementation  of  the  inverted  organization.  An  interestii'q 
difference  between  the  two  systems  appears  in  the  addition 
of  items.  With  feature  cards  items  can  be  added  to  the 
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system  relatively  easily  until  the  nvmiber  of  items  equals 
the  number  of  positions  on  a  card.  As  we  have  seen,  this 
leads  to  a  step  function  for  the  evaluation  of  retrieval 
cost  relative  to  the  size  of  the  data  base.  This  function 
will  apparently  be  linear  in  a  computer-implemented 
inverted  file  system,  but  limitations  on  the  amount  which 
can  be  read  in  a  single  transmission  can  reintroduce  a 
step  function. 
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XV.  Computer- Implemented  Item-Sequenced 
File  Organization 

A  straightforward  computer  implementation  of  an  item- 
seguenced  file  would  consist  of  a  set  of  item  keys,  one 
for  each  item  in  the  system.  Each  item  key  consists  of 
a  set  of  bits,  numbered  sequentially;  bit  position  j  in 
any  given  item  key  refers  to  descriptor  d^  .  Hence  each 
item  key  is  D  bits  long  (where  D  =  the  nvimber  of 
descriptors  in  the  system) .  Each  bit  has  a  value  of 
either  0  or  1  .  A  given  bit  position  in  a  given  item 
key  is  a  1  if  and  only  if  the  descriptor  associated  with 
that  bit  position  applies  to  the  item  associated  with  that 
item  key.  The  item  keys,  then,  correspond  to  the 
horizontal  lines  in  our  grid  model;  bit  positions  in  the 
item  keys  correspond  to  vertical  lines;  and  bits  with 
value  1  correspond  to  circled  intersections. 

A  retrieval  request  is  made  by  selecting  a  subset  of  the 
bit  positions  and  passing  through  the  entire  file,  testing 
each  item  key  for  bit  inclusion.  Specifically,  a  query 
key  word  is  crnstructed:  it  consists  also  of  D  bits 
numbered  sequentially;  a  given  bit  position  j  is  a  1 
if  and  only  if  descriptor  d^  is  in  the  query.  The 
query  key  word  is  then  used  to  evaluate  each  item  key  in 
the  file.  If  a  given  item  key  has  a  1  at  every  bit 
position  at  w'  ich  the  query  key  has  a  1  ,  then  the 
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corresponding  item  satisfies  the  query.  Addition  of 
an  item  to  the  system  is  accomplished  by  creating  a  new 
item  key.  The  addition  of  a  descriptor  requires 
lengthening  every  item  key  by  one  bit.  Circling  and 
uncircling  of  intersections  is  accomplished  by  setting 
bits  to  1  or  0  .  As  with  inverted  file  systems,  the 
cost  of  performing  these  operations  will  vary  enormously 
depending  on  characteristics  of  the  memory  medium. 

Such  a  computer-implemented  item-sequenced  file  is,  of 
course,  similar  to  an  edge-notched  card  system  employing 
direct  coding.  Indexing  information  is  grouped  by  item 
and  not  by  descriptor.  Position  within  a  group  (hole- 
position  on  an  edge-notched  card  or  bit  position  in  an 
item,  key)  identifies  a  descriptor  which  applies  to  an 
item.  Consequent?.'  most  of  our  conclusions  about  the 
comparative  ease  or  difficulty  of  various  operations  in 
edge-notched  systems  are  valid  for  computer-implemented 
item-sequenced  file  systems  as  well,  although  relative 
costs  may  be  quite  different  and  will  vary  enormously, 
according  to  details  of  a  computer  implementation  of  the 
item-sequenced  organization.  Again,  an  interesting 
difference  appears  in  the  addition  of  items.  In  edge- 
notched  systrms  adding  an  item  has  no  effect  on  retrieval 
cost  until  the  number  of  items  exceeds  the  capacity  of 
a  sorting  nc  iLe,  which  leads  to  a  step  function  for  the 
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evaluation  of  retrieval  cost  relative  to  the  size  of 
the  data  base.  This  function  will  apparently  be  linear 
in  a  computer- implemented  item- sequenced  file  system, 
but  limitations  on  the  amount  which  can  be  read  in  a 
single  transmission  can  reintroduce  a  step  function. 
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XVI .  The  Use  of  Indirect  and  Superimposed 
Coding  in  Computer  Implementations 

Having  maintained  the  analogy  between  our  grid  model  and 
the  computer- implemented  file  organizations,  we  can  sim.ply 
state  without  further  elaboration  that  both  indirect  and 
superimposed  coding  techniques  are  applicable  to  computer 
implementations  of  both  the  inverted  file  organization  and 
the  item-sequenced  file  organization.  Clearly,  the  same 
statistical  conditions  required  for  the  use  of  these 
techniques  in  edge-notched  and  feature  card  systems  will 
have  to  hold  in  order  to  justify  their  use  in  computer- 
implemented  systems.  If  these  conditions  hold,  then 
fewer  lists  will  be  required  in  the  inverted  file 
organization  and  fewer  bits  per  item  key  in  the  item- 
sequenced  organization. 
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XVII.  PDQ  (Proqretm  for  Descriptor  Query) 

PDQ  is  a  program  implemented  on  IBM  System  36P  equipment 
(with  disk  packs  as  the  secondary  storage  medium)  which 
provides  an  information  retrieval  capability  to  the 
system  user.  Cross-indexing  is  provided  by  an  item- 
organized  file  of  key  words.  The  design  principles  for 
the  coding  scheme  are  virtually  identical  to  the  design 
principles  in  edge-notched  card  systems,  and,  in  fact, 

PDQ  is  virtually  a  computerized  version  of  Anatol  Holt's 
Information  System  Theory  Project  edge-notched  card 
design,  with  some  added  capability  and  adaptive  features 
provided  by  the  computer. 

The  key  words  are  a  composite  of  direct,  indirect,  and 
superimposed  codes.  There  is  one  key  word  per  item,  A 
search  is  performed  by  constructing  a  query  key  word 
and  performing  a  bit  inclusion  test  of  the  query  key 
against  every  item  key.  Query  and  update  cost  functions 
are  analogous  to  those  for  an  edge-notched  card  system 
—  with  the  difference  being  that  in  edge-notched  card  systems 
the  cost  of  retrieval,  relative  to  the  number  of  items,  is 
a  step  function,  whereas  in  PDQ  it  is  linear. 

As  we  have  seen  already  from  our  grid  model  and  numerous 
examples,  the  coding  technique  is  not  determined  by  the 
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file  organization.  The  same  coding  technique  could  be 
employed  in  a  PDQ  which  used  an  inverted  file  organization. 
In  both  systems,  the  volume  of  cross-indexing  information 
is  identical.  The  only  difference  is  in  how  the  bits  are 
grouped  together  from  the  point  of  view  of  accessing. 
Referring  back  to  our  abstract  grid  model,  it  is  a 
question  of  whether  the  intersections  are  stored  row-wise 
(item-oriented)  or  column-wise  (descriptor-oriented) . 

Even  if  they  are  pseudo-random  access,  most  computer 
memories  and  secondary  storage  devices  require  a  one¬ 
dimensional  addressing  scheme;  the  cost  of  accessing 
material  along  another  axis  is  much  greater.  Hence  we 
must  choose  either  a  column  or  a  row  grouping.  The 
volumetric  formulae  presented  in  Section  XI  provide 
criteria  for  making  this  decision.  The  calculations 
below  are  based  on  PDQ  figures  and  compare  the  two 
organizations  as  a  function  of  the  ratio  of  retrievals  to 
updates . 

(a)  V  =  ^(p(d  -d-')  +  d.)  [from  Section  XI, 

^  formula  (f)] 

(b)  V  =  V(p(i-i/i)  +  l/i)  [from  Section  XI, 

formula  (g) ] 

when  ~  ^  volumetric  efficiencies  are 


identical 
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IF  Assianing  the  coding  technique  is  effective,  the  density 

of  1  bits  will  be  ^  and  d.  =  D/2  .  Substituting 

T  1 

and  solving  for  P  yields 


D(I-2) 

(c)  p  =  - - - 

3DI-2Id  -2D 
r 


since  0  <  d  <  cT.  =  D/2 
r  ~  1 


1-2  1-2 

(d)  -  <  p  <  - 

31-2  21-2 


and  since  I  >  >  2 


(e) 


1  'V 
T  < 


which  shows  that  if  update  transactions  comprise  more  than 
i.  of  the  transactions,  the  item- sequenced  organization  is 

3 

superior;  if  retrieval  transactions  comprise  more  than  1 

2 

of  the  transactions,  the  inverted  organization  is  superior. 


These  calculations  are  meaningful  but  do  not  tell  the 
whole  story,  since  volumetric  considerations  are  not  the 
only  factor.  In  particular,  the  following  major  consideration 
has  been  overlooked: 


Choosing  a  feature  card  is  like  choosing  a  hole  position. 
Analogous  operations  exist  in  PDQ  and  the  hypothetical  PDQ. 
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Both  involve  some  dictionary  which  converts  a  descriptor 
into  a  bit  pattern  for  inclusion  testing  or  into  the 
address  of  an  inverted  list.  The  difference  lies  in 
the  fact  that  the  cost  of  actually  obtaining  the 
appropriate  inverted  lists  involves  more  than  simply 
bit  transmission  (i.e.#  of  the  bit  volume  already  dis¬ 
cussed)  .  We  must  locate  the  lists  in  a  pseudo-random 
access  memory.  Because  the  set  of  lists  for  each 
retrieval  request  is  different,  there  can  be  no  way  of 
avoiding  an  additional  cost  factor:  seek  time  for  each 
inverted  list.  As  the  number  of  items  in  the  system 
increases,  the  number  of  inverted  lists  involved  in  a 
query  or  an  update  clearly  does  not  increase  at  the  same 
rate  as  does  the  volume  of  cross-indexing  bits.  Therefore, 
as  the  number  of  items  in  the  data  base  becomes  large, 
the  volumetric  considerations  will  dominate.  PDQ  is 
oriented  toward  a  small  data  base  with  frequent  updates. 

PDQ  would  be  more  suitable  for  a  larger  data  base  with 
infrequent  updates. 
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XVIII.  Batching  or  Buffering 

Thus  far  we  have  restricted  our  attention  to  retrieval 
situations  which  do  not  permit  batching.  By  batching 
we  mean  for  instance:  the  accumulation  of  a  number  of 
queries  (or  updates)  which  are  then  all  processed 
together  —  or  the  concurrent  processing  of  a  number  of 
queries.  From  one  perspective,  there  is  no  need  to 
distinguish  between  batching  and  buffering.  Later  we 
will  clarify  this  statement  and  give  formal  content  to 
these  terms  with  the  help  of  Petri  net  models.  In  the 
meantime,  we  will  use  them  interchangeably. 
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XIX.  Batching  Queries  and  Updates 


Some  item-organized  files  can  overcome  their  volumetric 
inefficiency  at  retrieval  time  by  batching  retrieval 
requests.  The  total  volume  of  cross-indexing  information 
will  be  transacted  with  for  each  request,  but  the  trans¬ 
mission  of  the  volume  need  occur  only  once  for  the  set  of 
batched  requests.  Edge-notched  card  systems  have  only 
a  very  limited  batching  capability  because  of  the  hardware 
characteristics  of  the  medium.  Computerized  item-sequenced 
retrieval  systems,  on  the  other  hand,  have  a  considerably 
greater  batching  capability,  determined  by  three  principal 
factors:  high  speed  memory  capacity,  high  speed  memory 
access  time,  and  secondary  storage  to  high  speed  storage 
transmission  time.  The  discrepancy  between  input/output 
transmission  speed  and  internal  processing  speed  —  the 
second  being  usually  much  greater  than  the  first  per 
query  per  item  —  creates  unused  processing  capability 
which  can  be  exploited  bv  batching. 


We  can  view  this  phenomenon  from  the  other  side,  by 
considering  the  subvolume  of  cross -indexing  bits  transacted 
with  in  an  inverted  file  when  requests  are  batched.  The 
more  requests  we  handle  concurrently,  the  more  inverted 
lists  we  will  have  to  fetch.  In  the  limit  the  subvolume 
becomes  equal  to  the  total  volume.  Hence  the  batching  of 
requests  reduces  the  volumetric  differences  between  item- 
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and  descriptor-organized  files  in  retrieval.  Similarly/ 
the  batching  of  updates  reduces  the  volumetric  differences 
between  item-  and  descriptor-organized  files  in  update 
processing.  Even  though  relative  volumes  are  only  a 
theoretical  measure  of  the  amount  of  information  to  be 
transacted  with,  we  can  expect  pragmatic  verification  for 
this  measure.  In  fact,  computer-based  item-organized 
information  retrieval  systems  generally  batch  requests 
and  computer-based  descriptor-organized  information 
retrieval  systems  generally  batch  updates. 
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XX.  An  Important  Asynmietry  from 
the  User's  Point  of  Viiw 


The  batching  of  requests  in  item-organized  systems  and 
the  batching  of  updates  in  descriptor-organized  systems 
have  similar  effects  in  terms  of  system  through-put 
capability  —  i.e.,  more  efficient  utilization  of  the 
hardware.  However,  they  differ  radically  from  the  point 
of  view  of  the  system  user.  This  is  best  seen  in  an 
interactive  context.  Under  such  circumstances,  the  user 
makes  a  request  and  expects  an  answer  to  his  request. 

He  may  not  be  willing  or  able  to  generate  a  batch  of 
requests  before  receiving  any  answers.  In  such  a  situation 
if  there  are  multiple  users  of  the  same  data  base,  the 
system  may  be  able  to  batch  requests  across  users;  but 
this  is  not  a  certainty,  and  in  many  situations  is  not 
possible.  On  the  other  hand,  when  the  user  performs  an 
update  he  does  not  expect  an  answer  back  in  the  same  sense. 
All  he  demands  is  that  subsequent  requests  do  njt  result 
in  incorrect  answers.  Hence,  even  though  the  user  may  not 
be  willing  or  able  to  batch  updates,  the  system  can  —  with 
internal  buffering  • —  batch  the  update  processing.  This 
asymmetry  is  of  critical  importance  in  system  design,  and 
suggests  that  an  interactive  information  retrieval  system 
with  a  large  data-base  should  almost  certainly  be  implemented 
using  the  inverted  file  organization  and  employing  internal 
batching  of  updates. 
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XXI.  An  Alternative  Method  of  Representing  Lists 
in  Inverted  File  Organizations 

T*-  2  use  of  indirect  and  superimposed  coding  techniques 
allows  us  to  increase  the  density  of  circled  intersections 
in  our  grid  model  and  to  decrease  the  number  of  vertical 
lines.  Hence  in  computer- implemented  inverted  file 
systems  these  techniques  provide  a  method  for  increasing 
the  density  of  bits  with  value  1  and  decreasing  the 
number  of  lists  in  the  system.  The  computer  also  permits 
an  entirely  different  method  of  increasing  the  density  of 
1-bits  —  without  decreasing  the  nvunber  of  lists.  This 
is  accomplished  by  abandoning  the  exact  correlation 
between  bit  position  in  the  list  and  item  number.  Instead, 
each  list  will  consist  of  the  item  numbors  themselves,  for 
exactly  those  items  to  which  the  descriptor  applies.  Since 
different  descriptors  apply  to  different  numbers  of  items, 
the  lists  will  vary  in  length.  The  process  for  inter¬ 
section  of  lists  is  now  computationally  more  complex,  and 
in  the  sequel  we  will  discuss  some  factors  which  affect 
the  cost  of  this  process.  If  the  computational  speed  is 
much  faster  than  input/output  transmission  time  between 
high  speed  storage  and  secondary  storage,  then  the  increase 
in  bit  density  and  the  consequent  reduction  in  total  bit 
volume  to  be  negotiated  with  may  easily  outweigh  the 
additional  computational  complexity  of  the  intersection 


process. 
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1i  Using  the  'atae  notation  as  in  Section  XI,  the  space  in 
bits  to  represent  the  cross-indexing  information  is 
given  by 


Vj  =  I  X  D 

where  I  is  the  number  of  items  in  the  system  and  D  is 
the  number  of  descriptors  (lists)  in  the  system.  If  we 
represent  each  list  by  the  method  suggested  above,  the 
space  in  bits  is  given  by 

V2  =  3;^  X  I  X  log2(I) 

where  d^  is  the  average  number  of  descriptors  that 
appiy  to  an  item  and  1092(1)  is  the  number  of  bits 
necessary  to  represent  an  item  number.  Hence 

Vj/Vj  =  X  1092  (d) 

In  other  words,  the  alternative  method  uses  less  space  when 
the  number  of  descriptors  in  the  data  base  is  greater  than 
the  product  of  the  average  number  of  descriptors  per  item 
and  the  number  of  bits  to  represent  an  item  nximber.  This 
is  true  of  a  wide  spectrum  of  information  retrieval 
situations.  The  fundamental  statistical  assumption  that 
makes  indirect  or  superimposed  coding  useful  —  i.e.,  the 
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notion  that,  while  there  may  be  many  thousands  of 
different  descriptors  in  a  data  base,  only  a  very  small 
number  of  descriptors  will  apply  to  any  single  item  in 
the  data  base  —  also  guarantees  that  D  is  greater  than 
di  X  log^ (I)  . 
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XXII.  An  Analogous  Alternative  for  Item- 
Sequenced  File  Organizations 

There  is,  of  course,  an  analogous  technique  for  increasing 
the  density  of  1-bits  in  item  key  words.  The  exact 
correlation  between  bit  position  and  descriptor  is 
abandoned.  Instead,  each  item  key  consists  of  the 
descriptor  numbers  themselves,  for  exactly  those  des¬ 
criptors  which  apply  to  the  item  represented  by  the  item 
key.  The  item  keys  will  now  vary  in  length.  The  process 
for  testing  inclusion  of  the  query  descriptors  in  the 
item  key  is  now  computationally  more  complex,  but  once 
again,  if  there  is  a  discrepcinoy  between  computational 
speed  and  transmission  speed,  the  reduction  in  bit  volume 
to  be  negotiated  with  may  outv;eigh  the  additional  com¬ 
putational  complexity. 

The  space  in  bits  required  by  this  method  is 

V3  =  df  X  I  X  log2(0) 


Hence 


V1/V3  =  D/(d.  X  log3(D)) 


This  technique  is  likely  to  save  space  in  all  systems 
that  satisfy  the  fundamental  statistical  assumption 
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mentioned  above.  In  fact,  for  a  wide  spectrum  of  systems 
the  number  of  descriptors  is  considerably  smaller  than  the 
number  of  items,  and  comparison  of  the  formulae  for  V2 
and  shows  that  in  such  cases  more  space  ••/ill  be 

saved  in  the  item-sequenced  organization  than  in  the 
inverted  file  organization. 
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XXIII.  A  Comparison  of  Three  Organizations 
for  Indexing 

We  have  now  come  far  enough  in  our  study  to  examine  three 
commonly  used  file  structures  for  indexing  —  the  item- 
sequenced  file,  the  inverted  file,  and  the  list-structured 
file  —  with  realistic  assumptions  about  hardware 
characteristics,  details  of  representation,  and  usage 
statistics. 

In  an  item-sequenced  file  each  document  is  represented  by 
an  entry  which  consists  of  a  variable  number  of  descriptors. 
A  query  is  performed  by  reading  the  whole  file,  determining 
for  each  entry  whether  the  query  is  satisfied  or  not,  and, 
if  satisfied,  adding  the  document  number  to  a  list  of 
'hits'.  An  update  is  performed  by  adding  an  entry  to  the 
file. 

In  an  inverted  file  each  descriptor  is  represented  by  a 
list  of  document  num.bers  (for  those  documents  to  which 
the  descriptor  applies).  A  query  is  performed  by  reading 
the  appropriate  list  for  each  descriptor,  intersecting 
the  lists,  and  ending  up  with  a  final  list  of  'hits'. 

An  update  is  performed  by  adding  a  document  number  to 
the  lists  for  those  descriptors  which  apply  to  the 


document. 
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II  In  a  list-structured  file  each  document  is  represented 
by  an  entry  which  consists  of  a  document  name  plus  a 
variable  number  of  descriptor-pointer  pairs.  The  pointer 
associated  with  a  descriptor  points  to  the  next  entry  to 
which  that  descriptor  is  applicable.  We  assume  we  have 
identified  which  descriptor  in  the  query  applies  to  the 
smallest  number  of  documents.  A  query  is  performed  by 
reading  in  the  first  entry  to  which  the  identified 
descriptor  applies;  for  the  entry  determining  if  the 
whole  query  is  satisfied  or  not;  if  satisfied,  adding 
the  document  number  to  a  list  of  'hits';  and  in  any  case 
reading  in  the  next  entry  to  which  the  identified  des¬ 
criptor  applies,  as  indicated  by  the  pointer  associated 
with  the  descriptor  in  the  current  entry.  This  process 
continues  until  a  null  pointer  is  encountered,  z^n  update 
is  performed  by  adding  an  entry  to  the  file  and  linking 
each  descriptor-pointer  in  the  entry  to  its  appropriate 
descriptor-pointer  chain. 

Basic  Formulae  for  the  Three  Design  Types:  Average  Volume 
dealt  with  in  a  Transaction 

(1)  Item-Sequenced  File: 


Vri  =  Id.  logjD 
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V  =  d.  log^D 
ul  1  ^2 

Vj  =  +  (l-p)V^j  =  Cld^  log^D)  (pCl-1/I)  +  l/i) 

(Compare  this  formula  with  Section  XI,  formula  (g)  and 
Sec'-i>  '  XVII,  formula  (b).) 


(2)  Inverted  File: 


'  D 

I  F(j)P(j) 
Lj  =  l 


log2l 


V 

UD 


D 

I  F(j)P(j) 

j  =  l 


10921 


V 

D 


(l-p)V 


uD 


4*  K 

where  F(j)  is  the  number  of  items  to  which  the 
descriptor  applies  and  P(j)  is  the  probability  distribution 
function  of  the  space  of  all  descriptor  occurrences  (i.e., 
if  we  observe  the  requests  over  a  period  of  time  and  create 
a  string  of  all  requests,  then  P(j)  is  the  probability 
that  at  any  point  on  that  string  the  descriptor 

occurs).  (Compare  this  formula  for  with  Section  XI, 

formula  (f)  and  Section  XVII,  formula  (a).) 
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If  r(j)  and  P{j)  are  both  unifona  distributions,  then: 


F(j)  =  d.I/L 


P(j)  =  l/n 


and 


V. 


/Id. 


.log  I  (p(d  -  d. )  +  d.  j 
D  y'  2  \  r  1  1/ 


(3)  List-Structured  File: 


V  =  (d  +  1)  d. (log„D  +  log.I)  +  log  I 
uL  1  \  1  2  2  2 


V  =  dV  +  (l-p)V 
T,  rL  cL 


d. (log  J  +  log  I) 
1  1  2  2 


+ 


F(j)P-  (j) 


+  (l-p)  (d.  +  1) 


where  P  (j)  is  the  probability  that,  when  the  number  of 
n 

u 

descriptors  in  a  request  is  n  ,  the  j  description  will 
(i)  be  in  a  request  and  (ii)  be  the  lowest  indexed  descriptor 
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in  the  request.  P  (j 1  is  defined  inductively  as  follows: 

n 


Pj^(j)  =  PCj) 


P 

n 


(j) 


P(j) 

d"  "" 

I 

S,=n 


+ 


k=n- 


(k) 


H-l 

P(Z)  +  Ip,  (k) 
k=n-2 


I'lote  that  the  descriptors  are  indexed  in  descending  order 
of  list  length.  Because  of  this  P  , (k)  =  0  for  l<k<n-2. 

In  an  item- sequenced  file,  since  the  whole  file  is  always 
read  when  performing  a  retrieval,  we  can  assume  ;hat  the 
only  seek  required  (in  pseudo-random  access  secondary 
store,  for  instance)  is  the  location  of  the  first  entry, 
and  that  the  file  can  be  read  serially  and  hence  at 
transmission  speed.  (This  may  require  double  buffering, 
and/or  overlapped  computation  and  I/O,  and/or  a  fast 
enough  computation  r<.te  to  permit  continuous  I/O  trans¬ 
mission,  and/or  very  short  start-stop  time  on  the  I/O 
gear,  etc.).  In  other  words,  we  assume  that  the  time 
required  to  process  the  query  is  simply  a  function  of  one 
seek,  serial  transmission  rate,  and  file  length.  We 
assume  the  time  required  to  process  an  update  is 
determined  by  one  seek,  serial  transmission  rate,  and 
entry  length. 
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r  In  an  inverted  file,  the  set  of  lists  to  be  read  will 

in  general  vary  from  query  to  query.  We  as"  jne  that  each 
list  will  require  a  minimtim  of  1  seek,  followed  by  serial 
transmission  of  the  entries  on  the  list.  The  time  re¬ 
quired  to  process  the  query  will  be  a  function  of  the 
seek  time,  the  number  of  descriptors  in  the  query  (i.e., 
the  number  of  lists  to  be  read) ,  the  serial  transmission 
rate,  and  the  individual  list  length.  We  assume  the  time 
required  to  process  an  update  is  determined  by  the  number 
of  descriptors  in  an  item,  the  seek  time,  serial  trans¬ 
mission  rate,  and  individual  list  length. 

In  a  list-structured  file  each  entry  read  will  require  a 
seek,  followed  by  serial  transmission  of  the  entry.  The 
time  required  to  perform  a  query  will  be  a  function  of 
the  seek  time,  the  number  of  items  for  the  most  specific 
descriptor  in  the  query,  the  serial  transmission  rate, 
and  the  individual  entry  length.  We  assume  the  time 
required  to  process  an  update  is  determined  by  the  size 
of  an  entry,  the  number  of  descriptor-pointer  pairs,  seek 
time,  and  serial  transmission  time. 

Average  Transaction  Time  for  the  Three  Design  Types 


=  +  r^(Id,log2D)(p(l-l/I)  +  l/l) 
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=  r 

D  a 


=(p(a_.-J.)  +  3^  (r,  + 


I  F(j)P(j) 
j=l 


logji) 


>(" 


I  F<j)PT  (j) 

j=l  r 


+  (1-p)  (3.  +  1))  +  r^Vj 


r 

■  D  1 

-s 

p 

I  F(j)P^  (j) 
.j=l  ^r 

+  (l-p) (d^  +  1) 

r  +  r. 
a  t 


+  log2l)  +  log^I 


where 


Tj  is  the  average  time  of  transaction  in  an  item- 
sequenced  file 

is  the  average  time  of  transaction  in  a  descriptor- 
organized  file 

T  is  the  average  time  of  transaction  in  a  list- 
L 

structured  file 

r  is  the  average  seek  time 
a 

r^  is  the  serial  transmission  rate 
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XXIV.  A  New  Method  for  Performing  List  Intersections 

Our  discussion  thus  far  has  concerned  itself  primarily 
with  questions  about  the  volume  of  information  to  be 
located  and  transacted  with.  There  has  been  no  con¬ 
sideration  of  internal  computation  beyond  the  assumption 
that  computational  speeds  are  always  sufficient  to  permit 
continuous  I/O  operation.  In  inverted  file  systems, 
however,  computation  of  the  intersection  of  lists  may 
prove  sufficiently  expensive  to  invalidate  this  assumption. 
E.  Wong  derives  formulae  for  estimating  the  average  number 
of  comparisons  necessary  to  calculate  the  intersection  of 
two  lists. ^  We  repeat  his  derivation  here: 


First,  let  A  and  B  be  unordered  lists.  Then,  assuming 
uniform  distribution  f-jr  the  location  of  a^  in  list  B, 
it  takes  an  average  of  i^g/2  comparisons  to  find  a^  , 
if  a.  is  in  B.  If  a.  is  not  in  B,  it  takes  n^ 
comparisons  to  ascertain  this  fact.  The  average  number 
of  comparisons  is  then 


^1  = 


<"a  - 


"ab'"b 


(r\  ^  V\ 

^  A  ~2~^  I 


Wong,  Time  Estimation  in  Boolean  Index  Searching 
(December  1961,  in  High  Speed  Document  Perusal,  AD  285  255). 
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where  n  is  the  number  of  elements  in  the  intersection 
AB 

of  A  and  B. 

When  A  and  B  are  ordered  lists,  a  logarithmic  se£irch 
procedure  can  be  adopted  (i.e.,  successive?.y  dividing  B 
into  equi-probable  subsets).  Again  assuming  a  uniform 
distribution,  the  number  of  comparisons  needed  for  each 
a^^  is  approximately  log^n^  if  a^eB  and  ^o92^B  ^ 
if  aj^^B  .  The  average  number  of  comparisons  is  then 

’'a  °  "ab^°«2''b  *  ‘"a  •  "ab)  (logjna  +  i) 

=  n^log^ng  +  (n^  - 

In  order  to  calculate  an  intersection  it  is  necessary  to 
determine  of  an  element,  a  ,  of  list  A  whether  or  not 
a  is  a  member  of  B.  This  is  a  familiar  problem  and 
immediately  suggests  the  use  of  hash  (or  scatter  storage) 
tables.  We  propose  the  following  procedure: 

Suppose  Ug  £  n^  .  Create  a  hash  table  containing  all 
members  of  B.  For  each  element,  a  ,  of  A  use  the 
hashing  procedure  to  decide  whether  a  is  a  meirber  of  B, 
AftB  contains  all  elements  of  A  for  which  this  decision 
is  yes. 
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^  We  observe  that  the  average  cost  has  three  components. 

T  =  cost  of  creating  hash  table  for  B 
31 

1^2  “  cost  of  deciding  for  elements  of  A  which  are 
in  B 

=  cost  of  deciding  for  elements  of  A  which  are 
not  in  B 

To  obtain  an  explicit  formula  for  this  cost  it  is  necessary 
to  choose  a  particular  hash  technique.  For  this  application 
the  following  technique  is  adequate:^ 

(1)  Generate  a  hash  address  from  an  entry  by  squaring 
the  entry  and  choosing  some  bits  from  the  center  of 
the  square. 

(2)  Resolve  collisions  by  random  probing. 

We  can  now  exhibit  explicit  formulae  of  cost: 

(c.  +  (E  -  1)  c  ) 

31  B  h  r 

where 

^See  Morris,  R. ,  "Scatter  Storage  Techniques", 
Communications  of  the  ACM,  January  1968. 
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c  =  cost  of  generating  a  hash  address 
h 

=  cost  of  random  probe 

E  =  average  nvunber  of  probes  necessary  to  hash 

an  entry  of  B 

=  -  ci)log(l  -  a) 
a 

a  =  load  factor  of  the  hash  table 

N  =  size  of  hash  table 

(See  Morris  for  the  derivation  of  E  and  of  A  below.) 


If  an  element  of  A  is  also  in  B  then  the  cost  of  the 
decision  is  the  same  as  the  average  cost  of  hashing  an 
element  of  B.  lienee 


n 


AE  (°h 


+  (E 


If  an  element  of  A  is  not  in  B  then  the  cost  of  the 
decision  is  the  same  as  the  cost  of  adding  a  new  element 
to  the  hash  table  containing  all  of  B.  Hence 


''sa  =  <"a  ■  "ab>  (s  * 


where  A  = ± — 

1  -  0( 


now  T,  =  T  ,  T  +  T 
3  31  32 
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=  "bK  *  'E-l)cJ  t  n^(c|,  t  (E-l)cJ 
*  (''a'"ab)  (% 

=  (V^B^V^r*  ■"  ‘'r^‘V"AB>  ■"  *<''a''''ab>) 

For  example,  suppose  we  allocate  space  for  a  hash  table 
one  third  larger  than  list  B.  Then 

N 

E  =  1.83 
A  =  4 

Let  us  assume  that 

c.^  5  4  comparisons 
=  2  comparisons 

Then 

T_  s  lOn,  +  5.6fn„  -  4.34n,  comparisons 
3  A  B  AB 

Let  us  now  compare  T^  with  T^  and  T^.  For  purposes  of 
comparison  let  n^  =  . 
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AB  2  A 


^AB~^A 


T,  X  n_ 

1  A  A 


‘•^A>  % 


‘W>  -  "a 


'^2  ^  ^A  ^  ”a  ^°'2  ^’^A'  ^ 


T^  15.66  X  n,  13.^7  X  n, 

3  A  A 


11.32  X  n. 


From  this  v/e  can  observe  that  if  n  >23  ,  then  T  >T^ 

i\  -J  X 

irrespective  cf  the  number  of  elements  in  the  intersection, 
n 

T  /T  =  - -  under  the  most  favorable  circumstances  for 

22.64 

T  in  the  comparison.  As  n  increases,  becomes 

proportionately  larger  than  T^.  Similarly,  if 
11  IP 

,  then  T^>T*  irrespective  of  the  number  of 
A  2  o 

elements  in  the  intersection. 

For  a  wide  spectrum  of  information  retrieval  systems, 
list  length  will  be  less  than  2^^  for  most  lists; 
hence  if  maintaining  the  inverted  lists  in  sorted  order 
incurred  no  additional  cost,  method  2  would  be  preferable. 

In  systems  which  permit  deletes,  it  may  be  extremely 
expensive  to  do  this,  and  it  would  be  necessary  to  judge 
this  expense  against  the  possibility  of  using  method  3 
(which  improves  relative  to  method  2  as  the  data  base  grows). 
Method  3  is  almost  always  preferable  to  method  1,  and 
system  design  evaluations  based  on  method  1  are  grossly 
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unfair  to  inverted  files. ^  Note  that  it  will  not  do  to 
sort  the  lists  iiranediately  prior  to  intersection.  This 
added  cost  would  make  worse  than  for  virtually  all 
cases . 


^M.  Kochen,  Preliminary  Operational  Analysis  of 
a  Computer-Based >  On-Demand  Document  Retrieval  System 
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XXV.  Data  Compression  —  Another  Encodement 
for  Inverted  Lists 


In  Section  XXI  we  discussed  a  method  for  increasing  the 
density  of  1-bits  in  each  inverted  list  which  abandoned 
the  exact  correlation  between  bit  position  and  item 
number.  Each  list  consisted  of  the  item  numbers  themselves, 
for  exactly  those  items  to  which  the  descriptor  represented 
by  the  list  applied.  The  volumetric  effects  of  this 
technique  were  discussed  in  Sections  XXI  and  XXIII.  It 
is  possible  to  compress  lists  even  further,  by  taking 
advantage  of  'burst'  characteristics  within  a  list. ^ 

In  our  previous  calculations  we  have  assumed  that  the 
number  of  bits  needed  to  represent  an  item  in  an  inverted 
list  is  '  where  I  is  the  total  number  of  items 

in  the  system.  Instead,  we  can  use  the  following  encoding 
procedure  for  an  inverted  list: 

(1)  Record  the  first  item  number  as  a  (log2l)  bit 
quantity. 

(2)  Take  the  difference  between  adjacent  item  numbers 
in  the  list. 

(3)  If  there  is  one  or  more  consecutive  differences 


*See  Computer  Programming  Techniques  for  Intelligence 
Analyst  Application.  AD  608  727,  October  1964. 
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of  1  Ci.e.,  two  or  more  consecutive  item  numbers), 
record  a  6  -bit  code  CSC^^)  ,  followed  by  the 
number  (+1)  of  consecutive  numbers  (less  than  61) . 

(4)  If  step  3  does  not  apply  but  the  difference  is 
less  than  61,  record  the  difference  as  a  6-bit 
quantity. 

(5)  If  the  difference  is  greater  than  61  but  less  than 
or  equal  to  4095,  record  a  6-bit  code  (SC2) 
followed  by  the  difference  as  a  12-bit  quantity. 

(6)  If  the  difference  is  greater  than  4095,  record  a 
6-bit  code  (SC^)  followed  by  the  difference  as 
a  (log2l)  bit  quantity. 

The  authors  of  Computer  Programming  Techniques  fcr 
Intelligence  Analyst  Application  show  a  slightly  more 
than  twofold  reduction  in  their  example  as  a  result  of 
this  technique. 

This  technique  could  be  applied  selectively.  In  particular, 
certain  inverted  lists  will  tend  to  grow  large  (some 
authors  suggest  a  Zipf  distribution  of  inverted  list 
lengths.)  The  compression  would  tend  to  be  most 
effective  for  such  lists  in  which  the  density  of  applicable 
items  is  highest.  Furthermore,  for  a  wide  spectrum  of 
information  retrieval  systems  in  which  item  numbers  are 
assigned  sequentially  on  input,  there  is  reason  to  expect 
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bursts  to  occur  as  a  result  of  additions  to  the  system 
of  groups  of  items  related  to  the  same  subject  or  subjects. 
In  any  case,  by  reserving  one  bit  at  the  beginning  of  each 
inverted  list,  the  system  can  adaptively  decide  for  each 
List  whether  the  compression  technique  should  be  employed. 
The  bit  informs  the  system  —  at  retrieval  or  update 
time  —  how  to  interpret  the  list.  This  may  involve 
more  computation  time  and  this  must  be  weighed  against 
the  reduction  in  the  number  of  bits  transmitted. 
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XXVI .  A  Granunar  for  Defining  Graph 

Representations  of  File  Structures 

In  this  section  v/e  present  a  formal  apparatus  for  defining 
models  of  file  structures.  We  do  not  pretend  to  have 
mathematical  techniques  which  permit  a  formal  analysis 
cf  models  defined  in  this  way.  At  this  point  in  time, 
the  formal  apparatus  serves  only  as  a  definitional  vehicle 
--  questions  of  cost  and  efficiency  of  utilization  under 
varying  usage  statistics  and  hardware  milieus  must  be 
answered  by  performing  a  mathematical  analysis  based  on 
formulae  such  as  those  derived  in  earlier  sections  of 
this  report.  Nevertheless,  a  formal  apparatus  for 
defining  models  of  file  struct\ires  is  of  value,  because 
such  definitions  can  provide  significant  insights  and 
facilitate  the  comparison  of  different  structures  meant 
to  accomplish  the  same  task. 

The  state  of  a  system  is  formalized  as  a  finite  undirected 
graph  with  labeled  nodes;  every  node  has  a  label,  called 
the  node  type,  and  several  nodes  may  have  the  same  label, 
“^uch  a  graph,  when  intended  to  represent  a  system  state, 
xs  called  a  configuration.  The  nodes  are  interpreted  as 
system  parts;  an  arc  is  interpreted  as  a  relation  between 
two  parts.  The  labels  correspond  to  classes  of  parts 
which  have  identical  possible  contexts.  Thus  the  nodes 
corresponding  to  the  cells  of  a  memory  might  have  identical 
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labels,  for  each  cell  has  the  same  contextual  possibilities: 
each  stands  in  relation  to  exactly  one  value  and  exactly 
one  of  the  cells  stands  in  relation  to  a  memory  exchange 
register.  It  will  be  observed  that  "part"  means  logical 
part,  not  physical  part;  a  part  is  anything  which  can 
participate  in  an  observable  relation.  Parts  may  be 
things,  values,  states,  etc. 

The  appearance  of  an  arc  between  two  nodes  asserts  the 
possibility  of  conditioning  some  occurrence,  internal  or 
environmental,  upon  a  relation  between  the  two  corresponding 
parts. 

The  formal  apparatus  of  n -grammar  is  built  around  the 
site-spec,  which  —  like  the  configuration  --  is  a  finite 
undirected  graph  with  labeled  nodes. 

A  )^-grammar  defines  a  class  (perhaps  infinite)  of 
configurations.  Members  of  this  class  are  said  to  satisfy 
the  grammar  or  be  grammatical.  The  set  of  configurations 
which  satisfy  a  grammar  corresponds  to  the  class  of 
possible  system  states  for  the  family  of  discrete 
information  systems  described  by  the  grammar. 


A  1' -grammar  sets  forth  the  local  laws  which  constrain 
relation  among  parts;  for  example: 
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Every  row,  column  pair  has  exactly  one 
value  holder. 

Every  bit  position  holds  either  zero  or 
one,  exclusively. 

Every  integer  has  a  unique  successor  or 
is  marked  with  a  "last"  marker,  exclusively. 
Every  control  counter  holds  exactly  one 
address. 

Every  binary  tree  element  has  at  most  two 
successors. 

-  Every  value  may  be  held  by  an  arbitrary 
number  of  value  holders. 


Formally,  a  ^-grammar  is  a  finite  list  of  grammar  rules. 
There  are  two  types  of  grammar  rules: 


Type  A  rule:  a  site-spec  which  properly 
contains  a  single  circled 
site-spec. 

N -  -  R - -  -  F 


Type  B  rule:  a  site-spec  containing  no 

circled  site-spec. 

N-. 


We  call  the  circled  site-spec  in  type  A  rule  the  subject 
of  the  rule.  A  rule  of  which  s  is  a  subject  will  be 
referred  to  as  an  s-rule. 


Site-Spec  Satisfaction 

We  say  that  a  site-spec  S  is  satisfied  in  a  configuration 
C  ,  or  S  has  a  satisfaction  in  C  ,  if  there  exists  a 
1-1  map  M  from  the  nodes  of  S  into  the  nodes  of  C 
such  that  the  type  of  every  S-node  is  the  same  as  that 
of  its  image  under  M  ,  and  for  every  arc  between  nodes  of 
S  there  is  an  arc  between  the  corresponding  inage  points 
in  C  .  The  map  M  is  called  a  satisfaction  map  or 
satisfaction  of  S  in  C  .  Any  arbitrary  collection  of 
nodes  in  a  configuration  is  called  a  place »  and  the  image 
of  S  under  M  is  in  particular  called  a  place  of 
satisfaction. 

A  place  p  in  a  configuration  obeys  a  rule  r  if  (D  it 
satisfies  the  subject  of  the  rule,  and  (2)  is  contained 
in  a  place  p'  which  satisfies  the  rule,  with  the  two 
maps  agreeing  on  the  subject;  o'  is  then  referred  to  as 
a  place  where  p  obeys  r  . 

Grammatical ity 

A  configuration  is  said  to  bf’  grammatical  if  and  only  if: 
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(1)  If  a  place  p  satisfies  a  subject 

s  in  the  grammar,  then  p  obeys  at 
least  one  s-rule  in  the  grammar. 

(2)  For  every  arc  (p,q)  in  the  con¬ 
figuration,  there  must  exist  a  rule 
r  satisfied  in  a  place  including  p 
and  q  ,  such  that  the  inverse  images 
p'  and  q'  under  the  satisfaction 
map  are  connected  by  an  arc  in  r  . 

(3)  There  are  no  satisfactions  of  type  B 
rules  in  the  configuration. 

The  way  in  which  the  definition  works  might  best  be 
clarified  by  example. 

In  the  absence  of  other  rules 
with  the  same  subject,  this  rule 

_  -  -  B 

asserts  that  every  A  has  at  least 
one  B. 


Every  A  has  exactly  one  B. 


0- 

0 


2v«ry  A  has  at  least  one  B  or 
at  least  one  C  or  at  least  one 
B  and  one  C. 


C 


Every  A  has  exactly  one  B  or 
exactly  one  C,  but  not  both. 


A 


B 


-C 


A 


C 


A  Simple  File  Structure 


Consider  the  following  (informally  characterized)  file 
structure: 


A  file  consists  of  neuned  records  and 
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named  properties. 

2.  No  two  records  (or  properties)  of 
the  same  file  may  have  the  same 
name. 

3.  Each  record  has  one  data  position 
for  each  property  of  the  file. 

4.  Each  property  specifies  a  domain 
of  possible  values. 

5.  Each  data  position  always  contains 
some  one  value  belonging  to  the 
value  domain  specified  by  the 
property  corresponding  to  that 
data  position. 

Cl*  is  a  configuration  which  represents  an  instance  of 
such  a  file  structure. 


*  To  facilitate  the  illustration  of  configurations  a 
convention  has  been  adopted  to  reduce  the  number  of  arcs 
in  the  drawing.  According  to  this  convention,  two  nodes 
are  connected  (they  "associate")  if  it  is  possible  to 
reach  one  directly  from  the  other  without  turning  sharp 
corners.  Hence  in  Cl  all  records  are  connected  to  the 
file,  but  not  to  each  other. 
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Mnemonic  Aid 
d  data  position 
F  file 
N  ncime 
P  property 
r  record 
V  value 
X  value  iJomain 


In  this  instance  the  file  consists  of  two  records  and  two 
properties.  One  property  specifies  a  value  domain 
consisting  of  four  values;  the  other  specifies  a  value 
domain  of  two  values. 


The  following  grammar  suffices  for  defining  the  class  of 
configuration  informally  characterized  above: 


1: 


r 


•  -  P 


A  filn  ha,  at  least  one  record  and  property. 


A  record  belongs  to  exactly  one  file  and 
has  exactly  one  name. 
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N 


A  property  belongs  to  exactly  one  file, 
has  exactly  one  name,  and  specifies 
exactly  one  value  domain. 


A  data  position  is  coordinated  by 
exactly  one  (record,  property)  pair  of 
a  file  and  contains  ex^^ctly  one  value 
from  the  value  domain  specified  by 
the  property. 


A  (record,  property)  pair  of  a  file 
coordinates  exactly  one  data  position. 
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r  P 


A  name  associates  with  at  least  one 
record  or  at  least  one  property  or 
both.  No  two  records  (properties) 
within  the  same  file  may  have  the 
same  name. 


A  value  associates  with  at  least  one 
value  domain. 


A  value  domain  associates  with  at  least 
one  value  and  at  least  one  property. 


It  is  interesting  to  consider  briefly  some  interpretations 
of  the  class  of  configurations  defined  by  the  abov'> 
grammar . 


Suupose  that  a  list  of  event  types  is  defined  such,  tliat 
tie  only  change  possible  a  configuration  i  .  t'*" 
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reassociation  of  data  positions  with  values.  In  other 
words,  given  a  starting  configuration,  the  number  of 
records  and  properties  in  a  fiJe  would  remain  constant, 
the  relations  between  properties,  value  domains,  and 
values  would  remain  fixed,  but  the  particular  value 
associated  with  each  data-positicn  could  change. 

A  conventional  fixed  length,  fixed  format  table  behaves 
in  such  a  manner.  The  lines  in  the  table  correspond  to 
records.  The  fields  in  the  table  correspond  to 
properties  and  the  fields  specify  a  value  domain  by  virtue 
of  the  number  of  bits  allotted  per  field.  The  only 
variable  aspect  of  such  a  table  is  the  set  of  values 
contained  in  the  field  positions  on  each  line. 


An  Interpretation  of  Configuration  Cl  as  an  instance  of 
a  Fixed-Length  Fixed-Fon«at  table: 


Line  1 
Line  2 


Gene 


Tom 


Gene  is  the  name  of  a  property  which 
specifies  a  2-bit  domain. 

Tom  is  the  naune  of  a  property  which 
specifies  a  1-bit  value  domain. 
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Line  1  is  the  name  of  a  record. 

Line  2  is  the  naune  of  a  record. 

Suppose  now  that  the  list  of  event  types  is  extended  to 
permit  addition  and  deletion  of  records  to  the  file. 

A  conventional,  simply-formatted  tape  file  behaves  in 
such  a  manner.  Every  record  in  the  file  has  the  same 
format:  each  record  consists  of  a  set  of  values,  one 

per  property  as  specified  by  the  format.  The  only 
variation  from  record  to  record  is  the  particular  value 
set.  However,  the  number  of  records  in  the  file  is 
permitted  to  vary. 

Suppose  the  list  of  event  types  is  further  extended  to 
permit  addition  and  deletion  of  properties  and  value 
domains.  A  system  with  dynamically  definable  variable 
length  tables  would  behave  in  such  a  manner.  A  data-base 
with  dynamic  restructuring  capabilities  might  also  be 
characterized  in  this  way. 

File  Hierarchies 


The  grammar  presented  above  provides  a  basis  for  con¬ 
structing  more  sophisticated  systems.  As  an  example,  the 
following  grammar  defines  a  class  of  configurations 
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suggestive  of  the  data*’base  structures  obtainable  in  ADAM.  ^ 


1: 


(From  rule  1  of  previous 
grammar) 


2; 


N 


(From  rule  2) 


3; 


(From  rule  6) 

C2  is  a  configuration  from  the  class  defined 
by  the  above  rules. 


C2: 


N 


N 


We  will  call  this  class  of 
a  File  Unit. 


Mnemonic  Aid 
F  %  file 
N  "V'  name 
r  'V  record 
configurations 


*ADAM  -  A  Generalized  Data  Management 
presented  at  SJCC  1966,  by  T.L.  Conners,  The“ 


System.  Paper 
Mitre  Corporation. 
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A  format  has  at  least  one  property. 


(Prom  rule  3,  previous  grammar) 


(Prom  rule  6) 


7: 


X 


(Prom  rule  7) 


8: 


V 


-P 


(Prom  rule  8) 


C3  is  a  configuration  from  the  class  defined 
by  rules  4  through  8: 
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C3; 


4- 


N - 


P 


R 


V  V 

We  will  call  this  class  of  configurations 
Format  Unit. 


a 


In  combination  with  rule  1,  this  guarantees 
that  every  file  will  have  exactly  one  format. 


(From  rule  4,  previous  grammar) 


(From  rule  5) 
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Rules  9  through  11  may  be  thought  of  as 
defining  the  way  in  which  File  Units  and 
Format  Units  connect.  C4  is  a  configuration 
from  the  class  defined  by  rules  1  through  11. 
Refer  back  to  configuration  Cl  for  comparison. 


We  now  extend  the  grammar  rules  to  permit 
file  hierarchies. 


Y 
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Previously,  a  property  specified  exactly 
one  value  domain.  The  addition  of  the  rules 
in  5a  permits  a  property  to  specify  either 
exactly  one  value  domain  or  exactly  one 
format  domain  (Y)  . 


Previously,  a  data  position  contained 
exactly  one  value  from  the  value  domain 
specified  by  the  associated  property.  The 
addition  of  the  rules  in  10a  permits  an 
alternative:  a  data  position  may  specify 
exactly  one  file;  the  format  of  that  file 
relates  to  the  property  associated  with  the 
data  position,  via  the  unique  intermediary 
format  domain  Y  , 


A  format  domain  associates  with  a  unique  format. 


Rules  5a  and  12  may  be  thought  of  as  defining  the  way  in 
which  Format  Units  connect  tc  each  other.  Rule  10a 
extends  the  way  of . connecting  Fiic  Units  to  Format  Units 
and  allows  files  to  have  sub-files. 


Rules  1  through  12  define  a  class  of  configurations  of 
which  C4  is  surely  a  trivial  instance.  A  more  interesting 
example  is  configuration  C5. 


Mnemonic  Aid 
d  data  position 
F  file 
N  '''  name 
P  '''  property 
r  'V  record 

V  '''  value 

X  value  domain 

Y  format  domain 
4)  format 


N 


N 


N 


N 


N 
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Sub- files  as  represented  here  are  analogous  to  ADAM 
repeating  groups.^ 

Tn  configuration  C5  v/e  have  represented  a  file  with  two 
records.  The  format  of  that  file  requires  that  each 
record  of  the  file  have  two  data  holders  that  contain 
unique  values  and  one  data  holder  that  specifies  a 
unique  sub-file.  The  values  must  be  chosen  from  the 
appropriate  value  domain,  as  specified  by  the  properties 
of  the  format.  All  sub-files  m.ust  have  the  same  sub¬ 
format  as  specified  by  the  single  P - Y  in  the  format. 

Hence,  the  tv/o  sub-files  exhibited  are  forced  to  associate 
v.’ith  the  appropriate  sub-format,  and  this  in  turn  guarantees 
that  their  records  (one  sub-file  has  a  single  record,  the 
other  has  two  records)  all  have  two  data  holders  containing 
unique  values  from  the  appropriate  domains. 

Once  the  Format  Units  have  been  chosen,  tne  grammar  allows 

(1)  the  number  of  records  in  any  file  to  be  variable,  and 

(2)  the  free  choice  of  which  value  (in  the  appropriate 
domain)  a  data  holder  associates  with;  but  the  grammar 
fixes  all  other  aspects  of  the  structure. 


^Conners,  op  cit. 


XXVTI. 


A  Criti  Tue  of  Balanced.  Trees 


In  preceding  sections  we  have  discussed  cross-indexing 
in  considerable  detail,  but  v/ithout  ever  directly 
considering  the  problem  of  converting  a  descriptor  into 
a  code  or  list  address.  All  of  the  computer-based 
indexing  systems  which  we  have  dealt  with  require  some 
sort  of  dictionary  to  perform  this  conversion.  In  this 
section  v/e  vrill  examine  the  use  of  balanced  trees  —  the 
technique  employed  to  accomplish  this  conversion  in 
MULTI-LIST.  As  is  often  the  case  with  information 
retrieval  systems,  the  justification  for  this  technique 
is  based  on  a  number  of  -questionable  assumptions  about 
usage  statistics  and  costs,  and  v;e  shall  want  to  challenge 
these  assumptions  or  at  least  make  explicit  their 
implications.  We  shall  then  shov/  that,  even  granting  these 
assumptions,  there  are  other,  more  efficient  techniques. 
Finally,  v/e  v;ill  propose  an  alternative  technique. 


The  following  quote  v/ill  give  some  idea  of  the  relative 

importance  of  Landauer's  work  in  the  MULTT-LTST  system: 

The  topic  of  this  dissertation  evolved  from 
the  research  on  new  methods  of  computer  memory 
organization  that  potentially  lend  themselves 
to  efficient  information  storage  and  retrieval. 

An  important  place  in  this  area  is  held  by  the 
MULTI-LIST  organization  of  the  memory,  which 
is  an  extension  of  the  list-type  associative 
memory  conceived  originally  by  Newell,  Shav/,  and 
Siraon  for  the  simulation  of  human  thought  processes 
in  learning  and  problem  solving.  The  tree,  which 
is  a  basic  building  block  of  the  MULTI-LIST  system. 
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constitutes  the  central  notion  of  this 
dissertation . ^ 

Let  us,  then,  briefly  outline  the  use  of  trees  in  MULTI¬ 
LIST.  MULTI-LIST  grows  a  tree  for  the  set  of  descriptors 
in  the  system.  All  nodes  of  the  tree  will  have  some 
fixed  number  m  of  branches.  The  tree  is  grown  (i.e., 
descriptors  are  added)  in  such  a  way  that  the  tree  is 
always  balanced  —  that  is,  each  level  is  completely 
filled  before  the  next  level  is  begun.  Thus  every  path 
through  the  tree  visits  either  n  or  n-1  (the  lowest 
level  may  be  incomplete)  nodes,  where  n  =  the  number  of 
levels  in  the  tree.  Each  node  consists  of  m  (m  =  the 
number  of  branches  at  each  node)  catenae.  Each  catena 
consists  of  a  key  and  a  pointer  to  some  node  in  the  next 
level;  the  pointers  of  em.crging  branches  point  to  lists. 
When  the  system  must  locate  the  list  for  some  descriptor 
(i.e.,  in  processing  a  query),  the  tree  is  used  as 
follows:  The  descriptor  is  compared  arithmetically  with 

the  key  of  t’ne  first  catena  of  the  root  node;  if  it  is 
less  than  or  equal  to  the  key,  the  pointer  in  that  catena 
is  follo'.;ed  to  a  node  in  the  next  level.  If  the  descriptor 
is  greater  than  the  I;ey  in  the  first  catena,  it  is 
compared  to  the  key  in  the  second  catena  of  the  node;'  if 

'''.'alter  I.  Landaucr,  The  Tree  as  a  Stratago'm  for 
Automata  c  ln£orm.ation  Handling ,  Ph.D.  Thesis,  University 
ot  Pennsylvania,  19»2.  Page  xii. 
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it  is  less  than  or  equal  to  that  key,  the  associated 
pointer  is  tollov/ed;  if  not,  the  descriptor  is  compared 
to  the  key  in  the  next  catena;  and  so  forth.  The  tree  is 
constructed  in  such  a  way  that  the  descriptor  must  be 
less  than  or  equal  to  at  least  the  last  key.  "^he  same 
procedure  is  fcllov/ed  at  the  next  node  encounter >-,d ,  and 
sc  forth.  The  pointer  taken  at  the  lowest  level  (i.e., 
upon  emerging  from  the  tree)  will  lead  to  the  list  for 
the  descriptor.  Suppose,  for  example,  that  the  possible 
descriptor  names  are  the  integers,  and  that  we  have  a 
system  which  thus  far  contains  the  following  descriptors: 
4,  5,  13,  17,  19,  23,  32,  34,  37,  41,  42,  43,  44,  45, 

52,  56,  59,  61,  62,  73,  75,  76,  77,  79,  83,  91,  98. 

Figure  XXVII-1  shows  a  tree  for  this  example,  with  m 
(i.e.,  the  number  of  branches  at  a  node)  =3.  n  (=  the 
number  of  levels)  =  3,  and  F  i-the  number  of  emerging 
branches)  =  27. 


Landauer  derives  cost  measures  for  such  tree  structures  -- 

primarily  in  order  to  determine  the  optimum  value  for  m  . 

He  begins  by  asserting  that  retrieval  is  the  only  operation 

which  need  bo  considered  in  evaluating  system  costs: 

The  management  of  an  information  handling  system 
nvolvos  three  basic  operations:  filing,  retrieval 
and  deletion  of  an  item.  Whereas  filing  and 
deletion  are  operations  that  keep  the  file  up¬ 
dated,  and  are  therefore  inherently  "one  shot" 
operations,,  retrieval  wiJl  in  all  probability 
tie  a  recurnuit  operation,  i.e.,  a  single  item 
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may  be  retrieved  many  times  within  its  life¬ 
time  in  the  file.  Consequently  the  pre¬ 
ponderance  of  the  retrieval  operation  with 
respect  to  speed  and  efficiency  is  obvious. 
Hence  the  conditions  for  optimum  efficiency 
will  be  obtained  from  the  task  of  information 
retrieval,  specifically  from  a  single  search 
of  the  file.^ 

Landauer  then  defines  the  cost  of  a  search  as  the  product 
of  the  time  per  search  (which  in  turn  is  defined  solely 
by  the  number  of  comparisons  made  during  the  search)  and 
the  system  cost.  System  cost  is  assumed  proportional  to 
memory  size  alone. 


The  average  search  time  T  (i.e.,  the  average  number  of 
comparisons  per  tree-traversal)  is  given  as: 


m  log  F  m+1 

T  =  n  ^  i  =  - § - 

i=l^  log^m  2 


The  cost  of  the  tree  in  catena  units  is; 


F  F 

C=:F  +  -+-  + 

m  m'^ 


m(F-l) 

m-1 


The  product  of  C 


I  =  C  X 


and  T  is  the  inverse  of  the  efficiency 

m(F-l)  m+1  log  F 
T  - - !e_ 

m-1  2  log  m 

’e 


I 


^Landauer,  op  cit.  Page  13. 
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"  ~  yields  a  minimum  at  m  =  5.25  approximately. 

diT’ 

This  value  of  m  is  the  branching  factor  corresponding 
to  an  optimum  in  retrieval  efficiency."^  On  the  other 
hand,  "the  differentiation  of  the  traversal-time,  T 
with  respect  to  m  yields  a  minimum  at  m  =  3.5 
approximately.  However,  a  plot  of  the  corresponding 
curve  shows  that  the  minimum  is  rather  broad  and  m  =  5.25 
that  was  obtained  above  to  minimize  the  product  of  cost 
and  time  lies  well  within  the  range  of  the  broad  traversal¬ 
time  minimum."** 

We  would  like  to  raise  three  fundamental  objections  to 
Landauer's  arguments.  Let  us  first  deal  with  his  assertion 
that  only  retrieval  is  significant  in  evaluating  system 
efficiency.  To  begin  with,  the  characterization  of 
filing  and  deletion  as  "inherently  'one-shot'  operations" 
cannot  possibly  be  a  universal  property  of  all  information¬ 
handling  systems.  On-line  command  and  control  or  intelligence 
systems  require  continuous  "real-time"updating  facilities, 
and  the  filing  and  deleting  of  information  which  is  never 
actually  "retrieved"  can  be  a  common  phenomenon.  Promoters 
of  MULTI-LIST  have  in  fact  argued  for  its  use  over  other 
systems  in  such  contexts  just  because  of  the  "relative 

^ Ibid ,  page  15. 

** Ibid ,  page  16. 


ease  of  update"  in  MULTI-LIST.  By  leaving  out  the  cost 
of  filing  and  deletion  in  what  follows,  Landauer,  in 
effect,  restricts  the  validity  of  his  work  to  situations 
in  which  (presumably  after  some  initial  phase  of  system 
creation)  there  are  no  updates.  (The  reader  is  referred 
to  the  formulae  in  Section  XXVII,  in  which  the  role  of 
the  retrieval/update  ratio  is  explicitly  expressed.) 
Suppose  we  accept  this  assumption.  If  filing  and  deletion 
are  not  cost  factors,  we  may  structure  the  information  in 
any  way  we  choose,  without  having  to  account  for  the  cost 
of  the  structuring  procedure.  We  might  then,  for  example, 
consider  the  following  structure;  a  single  sorted  list 
of  associative  catenae,  ordered  by  key  magnitude,  just  as 
Landauer  suggests,  but  searched  by  binary  search  technique 
instead  of  serial  comparison.  In  this  case  the  space 
required  for  associative  catenae  is  clearly  a  minimum 
(equal  to  F  ,  the  number  of  lists  that  emerge  from  the 
tree).  This  arrangement  can  be  viewed  as  identical  to 
Landauer ' s  prescription,  but  with  m  set  equal  to  F 
so  that  we  have  a  one-level  tree.  The  number  of  com¬ 
parisons  in  a  traversal  using  binary  search  technique  will 
of  course  be  which  is  a  significant 

improvement  over  Landauer 's  (F+l)/2  .  In  fact  —  if  we 
correct  Landauer 's  formula  for  the  number  of  comparisons 
per  search  (see  below)  —  log2F  will  correspond  roughly 
to  the  cost  of  a  MULTI-LIST  search  through  a  tree  with 
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m  =  2  (i.e.,  a  binary  treel ,  which  is  a  minimum  for  the 
number  of  comparisons.  Hence  the  single  sorted  list  of 
associative  catenae,  ordered  by  key  magnitude  and  searched 
by  utilizing  binary  search  technique,  requires  no  more 
space  than  the  most  space-conservative  balanced  tree  and 
no  more  comparison  time  than  the  most  comparison-con¬ 
servative  balanced  tree.  In  short,  if  we  accept  Landauer's 
assumptions  and  cost-criteria,  it  Is  superior  to  the 
balanced  tree  scheme  regardless  of  the  value  of  m 
—  whether  it  be  2,  5.25,  3.5,  or  any  other  number. 

It  may  be  expensive  to  generate  or  maintain  a  fully  sorted 
list  of  this  size,  but  Landauer's  formulation  of  the 
problem  precludes  considering  generation  or  maintenance 
as  cost  factors.  Again,  there  may  be  difficulties 
associated  with  the  fact  that  a  single  sorted  list  cannot 
fit  into  main  memory,  but  Landauer  assumes  that  the 
memory  is  a  one-level,  truly  random  access  device.  If 
there  is  a  set  of  assumptions  about  usage  statistics  and 
hardware  which  justify  Landauer's  conclusions,  they  are 
certainly  not  the  assumptions  which  he  makes.  In  fact 
there  may  be  no  assumptions  which  lead  to  his  conclusions. 

Our  second  fundamental  objection  is  to  Landauer's  definition 
of  the  cost  of  a  search  as  the  product  of  the  number  of 
comparisons  and  <-he  "system  cost",  where  the  latter  is 
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proportional  to  the  amount  of  memory  required.  Clearly  no 
computer  installation  operating  in  batched  sequential 
mode  would  ever  charge  rates  based  on  such  a  formulation. 
One  cannot  in  general  assume  that  there  will  be  a 
reduction  in  real  cost  when  less  than  the  available 
memory  is  used.  Even  a  time-shared  facility  could  not 
charge  rates  on  such  a  basis.  For  example,  in  any  real 
computing  milieu,  main  memory  is  limited;  if  it  is 
overflowed,  input/output  delays  may  cause  enormous  in¬ 
creases  in  time  when  secondary  storage  is  utilized. 
Landauer's  formulation-,  of  course,  does  not  evaluate  such 
possibilities.  In  general,  the  relative  cost  of  memory 
space  and  computational  time  will  vary  enormously  from 
one  computing  milieu  to  another,  so  that  Landauer's 
product  —  although  mathematically  very  convenient  — 
will  not  be  very  useful  as  a  measure  of  efficiency. 

Before  leaving  MULTI-LIST  we  must  raise  one  further 
objection  to  Landauer's  formulations.  Based  on  the 
assumption  that  the  keys  of  a  node  have  equal  chances  of 
being  selected,  Landauer  asserts  that  the  average  number 
of  comparisons  at  a  node  will  be  51^  .  However,  there 
is  no  need  to  compare  a  descriptor  against  the  m^^ 
catena  of  a  node  (except,  possibly,  at  the  lowest  tree 
level) :  if  the  first  (m-1)  comparisons  fail,  merely 

follow  the  m^^  pointer.  In  other  words,  the  average 
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number  of  comparisons  is  (m+l)/2  -  1/m  (except,  possibly, 
for  lowest  level  nodes) ,  and  the  computation  time  involved 
must  lie  between  (m+l)/2  and  (m+l)/2  -  1/m  .  The 
corrected  probability  estimate  yields  (for  trees  with 
large  F  ,  at  least)  a  comparison  time  minimum  at  m  =  2 
i.e.,  a  binary  tree  —  and  not  at  m  =  3.5  .  This  ad¬ 
justment  also  has  repercussions  for  the  evaluation  of  the 
minimum  for  Landauer's  product-formulation  of  efficiency, 
of  course. 


This  critique  would  be  incomplete  if  we  did  not  suggest 
some  alternative  to  the  balanced  tree  as  a  decoding 
technique  for  descriptors.  We  have  already  suggested  a 
formulation  involving  binary  search  techniques  applied  to 
a  single  sorted  list,  which,  given  Landauer’s  assumptions, 
is  superior  to  his  solution.  However,  the  critical  factors 
of  generation  and  update,  ignored  by  Landauer,  are  known 
to  be  expensive  functions  when  working  with  a  single 
sorted  list.  The  following  technique  is  sup'^rior  both  for 
Landauer's  rather  narrow  assumptions  and  for  more  general 
assumptions . 

Reference  keys  (i.e.,  descriptors)  are  decoded  by  using 
hash  techniques.  Costs  here  depend  upon  calculation  time 
for  the  hash  function,  search  time  in  the  hash  table,  and 
storage  space  required  by  the  table.  Extra  space  is 


required  in  the  table  in  order  to  maintain  low  search 
times,  but  it  has  been  shown  that  the  use  of  a  good 
hash  function  (such  as  radix  conversion)  permits  80% 
utilization  of  the  hash  table,  with  searches  requiring  on 
the  average  less  than  two  comparisons  in  the  hash  table. 
Entries  in  the  table  will  contain  the  key  and  a  pointer 
to  the  appropriate  list.  Hence  the  table  will  require 
only  slightly  more  space  than  the  fully  sorted  list 
(Landauer's  m  =  F  case).  Time  will  be  approximately 
two  comparisons  +  hash  function  evaluation.  The  larger 
F  is,  the  more  favorably  this  technique  compares  with 
binary  search  and  balanced  tree  search.  There  is  no  need 
for  sorting.  Hash  table  expansion  and  contraction  are  the 
only  updating  functions  that  have  a  cost  significantly 
greater  than  decoding.  (Addition  or  deletion  of  a  single 
key  involves  only  slightly  more  cost  than  decoding  a  key. 
Repeated  addition  or  deletion,  analogous  to  generation  and 
update,  involve  table  expansion  and  contraction.)  Table 
expansion  is  handled  by  recognizing  when  the  hash  table  is 
80%  full,  then  requesting  that  the  hash  function  increase 
the  modulus  it  is  using  (i.e.,  increase  hash  table  size), 
rehashing  the  current  entries  in  the  table,  and  continuing 
from  there.  Contraction  is  handled  by  the  same  procedure, 
except  that  trie  modulus  is  decreased  when  the  table  is 
sparse.  The  whole  hash  table  need  not  fit  in  core:  high 
or, her  bits  of  tfie  evaluated  hash  unction  determine  the 
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appropriate  segment  of  the  hash  table.  A  simple 
mechanism  prevents  oscillation  between  two  adjacent 
hash  table  sizes.  In  the  next  section  we  will  discuss 
this  technique  in  more  detail. 
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XXVIII.  Hashing  and  Secondary  Storage 

This  section  is  concerned  with  the  use  of  scatter  storage, 
or  hash  techniques  to  retrieve  information  from  large 
data  bases.  We  will  draw  a  distinction  between  scatter 
index  tables,  where  each  entry  contains  a  pointer  to  the 
desired  item,  and  scatter  storage  tables,  where  each 
entri  is  the  desired  item.  We  are  particularly  interested 
in  the  case  where  both  the  data  base  and  a  scatter  index 
table  giving  access  to  the  data  base  are  so  large  that 
they  cannot  be  contained  in  core.  This  situation  commonly 
occurs  in  the  environment  of  a  time  sharing  system  on  a 
paged  machine  with  a  disc  or  drum  secondary  store.  In 
such  an  environment  the  entire  data  base  would  reside  in 
secondary;  a  scatter  index  table  would  primarily  reside 
in  secondary  but  it  would  be  possible,  in  general,  to 
lock  a  few  pages  in  core.  We  will  be  concerned  with  two 
formats  for  the  data  base:  fixed  length  and  variable, 
unbou.nded  length.  Since  disc  access  times  and  transfer 
times  are  typically  4  to  5  orders  of  magnitude  greater 
than  memory  cycle  times  we  are  primarily  concerned  with 
minimizing  the  number  of  disc  accesses  so  as  to  both 
increase  throughput  rate  of  the  information  retrieval 
sytter  and  decrease  response  times  of  individual  requests. 


First,  let  US  assume  that  the  entries  of  the  data  base 
are  of  fixed  length,.  Two  widely  divergent  methods  of 


access  are  applicable  here.  One  is  to  construct  a  scatter 
index  to  secondary  storage;  the  other  is  to  organize  the 
data  base  as  a  scatter  storage  table  and  access  the  data 
base  directly. 

Let  us  first  consider  the  use  of  a  scatter  index  table. 
Since  accesses  •"'e  made  randomly  there  is  no  special 
advantage  in  locking  any  particular  page  in  core;  hence 
we  suppose  that  the  entire  table  is  in  secondary  storage. 


If  we  use  either  random  probing  or  chaining  as  a  collision- 

*  s  t 

resolving  discipline  then  we  can  expect  the  (i+1) 

probe  to  be  in  the  same  page  as  the  i^  probe  with 

probability  —  where  2^  is  the  number  of  pages  that 
^m 

the  table  occupies.  We  must  also  allow  one  probe  to  get 
the  item  from  the  data  base  after  its  index  is  learned. 

The  expected  number  of  disc  accesses  for  each  discipline 
is  therefore: 


random  probing 


E 

rp 


-  In(l-a) 


chaining 


E 


ch 


where  a  is  the  table  Jersity. 
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If  we  use  the  next  empty  place  method  of  resolving 
collisions  we  only  access  additional  pages  when  the  first 
probe  lies  near  the  end  of  a  page.  (Since  even  for  a 
table  density  of  .99  the  expected  number  of  probes  is 
50.5  and  since  a  typical  page  size  is  512  words  we  need 
not  consider  the  possibility  of  accessing  more  than  two 
index  pages  in  order  to  calculate  the  expected  number  of 
accesses . ) 

next  empty  place 

^ne  '  ^  "  (;K,)  [(f)  (U)-  ^ 

=  2 "  4-(— ) 

2  +1  \l-a  ) 

where  2^  is  the  page  size. 

Assuming  an  index  of  256  pages  of  512  words  each  we  get 
the  following  table: 

Table  1 


a 

.  5 

.  6 

.7 

.  8 

.9 

random  probing 

2.39 

2.53 

2.72 

3.  G1 

3.56 

chaining 

2.25 

2.30 

2.35 

2.40 

2.45 

next  entry  place 

2.00 

2.00 

2.00 

2.01 

2.01 

From  the  table  we  can  observe  that  even  though  the  next 
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empty  place  method  is  computationally  the  least  efficient 
it  is  nevertheless  the  most  suitable  as  far  as  disc 
accesses  are  concerned. 

We  now  propose  a  variation  of  the  scatter  index  table 
which  is  computationally  efficient  and  which  requires  no 
more  disc  accesses  than  the  next  empty  place  method. 

This  method  uses  a  computed  hash  code  of  m+n  bits. 

The  first  m  bits  are  used  to  index  a  locked-in  table 
of  2*^  entries,  each  of  which  is  a  page  address.  The 
addressed  page  is  brought  into  core,  and  the  low  order 
n  bits  are  used  as  a  key  to  locate  the  entry  in  this 
page.  Any  one  of  the  three  collision-resolving  methods 
may  be  used.  We  emphasize  that,  whichever  method  is 
chosen,  it  is  only  applied  within  the  one  page  (e.g., 
the  pseudo- random  number  generator  used  in  the  random 
probing  method  generates  integers  between  1  and  2^) . 
Obviously,  this  method  will  not  work  if  more  than  2^ 
entries  map  into  the  same  page.  We  now  show  that  for 
acceptable  table  densities  this  overflow  condition  is  so 
improbable  that  it  may  be  safely  ignored.  Later  we  will 
present  a  method  of  avoiding  overflow  entirely. 

Since  hash  addresses  are  assumed  to  be  computed  randomly, 
the  probability  of  an  entry  being  mapped  into  a  given 
page  is  .  We  have,  in  fact,  a  binomial  distribution. 
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Writing  Pj^  for  the  probability  of  k  entries  mapping 
into  the  same  page,  we  have 


aN-k 


where  N 


2m+n  number  of  places  in  the  table 


a  ,  as  usual,  is  the  table  density 


Now 


(1) 


Pr  [overflow]  =  1-Pr  [no  overflow] 


,n 


=  1-  I  Pk 

k=0 


Vie  note  that  (aN“)  /2aN  is  large  in  general,  so  that 

2 

the  Poisson  approximation  is  not  applicable.  However, 
aN~(l — i-)  is  of  reasonable  magnitude  (e.g.,  for  the 


case  of  256  512-word  pages  and  a  =  —  ,  it  is  255) ,  and 

2 

so  we  may  employ  the  normal  approximation.  We  then  have^ 


(2) 


Pr  [overflow]  1- 


(x 


2 


n 


$ (x_ 1 ) 
2 


‘w.  Feller,  ^n  Inf^oduction  to  Probability  Theory 
and  its  Applications,  Vol.  1,  page  172. 
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where  =  (t-a2’^)h 

h  =  [ci2'‘(l  -  _i)  IT 

2” 

We  present  a  table  of  these  probabilities  for  the  case 
of  256  512-word  pages  with  various  values  of  a 

Table  2 

a  =  .5  .6  .7  .8 

Pr[overflow]  <8x10  <8x10  1.8xl0~^^  1.6xlo'^ 

We  observe  that  for  a_< .  7  ,  it  is  so  improbable  that 
overflow  occurs  that  the  possibility  may  safely  be  ignored. 
Note  that  with  this  method  exactly  two  accesses  are 
required,  one  to  retrieve  a  page  of  index  table  and  one 
to  retrieve  a  page  from  the  data  base. 

We  now  discuss  a  method  which  eliminates  the  possibility 
of  overflow.  Even  in  cases  of  high  table  density,  over¬ 
flow  is  so  improbable  that  relatively  few  pages  are 
involved.  We  can  eliminate  overflow  by  splitting  any 
page  with  2^  entries  into  two  parts  and  writing  one 
part  on  a  new  page.  An  obvious  way  to  do  this  would  be 
to  compute  originally  a  hash  code  of  n+m+r  bits,  saving 
the  r  extra  bits  with  each  entry.  Then,  when  a  page 


.9 

.008 
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is  split  into  two  pages  the  low  order  bit  is  exeunined 
to  determine  in  which  of  the  two  pages  the  entry  is  to 
go.  Thereafter  m+1  bits  are  necessary  to  determine 
which  index  page  should  be  brought  in  from  disc.  Since 
m  bits  are  used  ordinarily  the  problem  is  to  determine 
when  m  bits  are  insufficient.  This  is  a  table  look¬ 
up  problem  amenable  to  scatter  storage  techniques.  We 
create  a  small  locked-in  hash  table  (e.g.,  64  entries 
for  a  table  of  256  pages)  using  some  of  the  m  bits 
used  to  identify  a  page  as  a  key.  (Notation:  we  call 
this  table  a  cluster-buster  table.)  Entries  are  made 
in  the  table  whenever  a  page  is  split,  each  entri  ^eing 
the  address  of  the  new  page  which  now  contains  part  of 
the  old  page.  When  adding  a  new  item  or  looking  up  an 
old  one,  reference  is  first  made  to  the  cluster-buster 
table.  If  no  entry  is  found,  then  the  page  has  never 

been  split  and  its  address  can  be  found  in  the  locked-in 

s  t 

index  table.  If  an  entry  is  fc  ’'.d,  then  the  (m+1) 
bit  of  the  key  is  examined.  If  the  bit  is  0  then  the 
first  m  bits  of  the,  key  are  an  index  to  locate  the  page 
address  in  the  locked-in  index  table.  If  the  bit  is  1, 
then  the  small  hash  table  entry  contains  the  page  address. 


Mote  that  this  procedure  is  reversible.  That  is,  if 
sufficiently  many  deletions  are  made  in  the  two  halves  of 
a  split  page,  then  the  two  halves  may  be  recombined  into 
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one  page  and  the  entry  deleted  from  the  cluster-buster 
table.  Note  also  that  this  procedure,  with  some 
modification,  may  be  applied  to  previously  split  pages 
which  have  again  filled  up.  Hence  it  is  possible  to  have 
pages  in  the  index  which  have  split  severa  times,  whereas 
other  pages  are  only  partially  full.  (In  order  to  im¬ 
plement  multiple  splitting  the  following  change  must  be 
made:  the  entry  in  the  cluster-buster  table  contains 
(1)  the  number  of  additional  bits  needed  to  identify  all 
the  pages  into  which  the  original  page  was  split;  e.g. , 
if  a  page  is  split  into  four  pages  then  two  additional 
bits,  m+2  bits  in  all,  are  needed,  and  (2)  the  address 
of  a  locked-in  auxiliary  table,  indexed  by  the  additional 
bits,  which  contains  the  disc  addresses  of  each  page.) 
These  two  features  (i.e.,  recombination  and  multiple 
splitting)  aliow  a  single  index  table  to  accomodate  a 
varying  skewed  data  base.  In  the  event  that  additional 
locked-in  space  in  core  is  available  and  that  it  is  not 
expected  that  a  page  will  ever  have  to  be  split  more  than 
once,  the  cluster-buster  hash  table  can  be  replaced  by  an 
indexed  table;  this  would  use  space  inefficiently,  but  the 
time  advantage  of  indexing  as  opposed  to  probing  a  hash 
table  might  justify  the  waste.  For  a  data  base  of  in¬ 
determinate  size  the  cluster-buster  technique  can  be  used 
to  trigger  an  overall  system  expansion.  When  the  cluster- 
buster  table  has  more  than  a  certain  percentage  of  entries 
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the  entire  system  can  be  doubled.  Those  pages  which  have 
already  split  will  not  be  split  again  and  all  entries  for 
pages  which  have  split  exactly  once  can  be  deleted  from 
the  cluster-buster  table. 

Note  that  use  of  the  cluster-buster  table,  just  as 
ignoring  improbable  overflow,  guarantees  that  precisely 
two  disc  accesses  are  required.  We  now  investigate  the 
disc  access  efficiency  of  each  method.  Since  the  collision 
resolving  method  within  each  page  of  index  table  does  not 
affect  the  probability  of  overflowing  that  table,  there 
is  no  reason  not  to  use  the  most  efficient  method  — 
chaining.  Similarly  there  is  no  reason  not  to  use 
chaining  within  the  cluster-buster  table.  Disc  access 
cost  is  a  function  of  the  number  of  probes  required  to 
access  an  item.  If  k  items  are  mapped  into  a  page, 
then  the  expected  member  of  probes  to  access  an  item  on 
that  page  is  1  +  ■  We  have  previously  calculated 

that  p  ,  the  probability  that  k  items  are  mapped  into 

)v 

a  page,  obeys  the  binomial  distribution.  Hence,  ignoring 
overflow 


,n 


probe 


=  I  Pvd  + 


k=0 


.n+1' 


aN 

i  ^  ^  4n: 

k  =  0  ^  2 
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aN  ,  aN 

I  ^+T  ^  Pk^ 

k=0  ^  2”  k=0  ^ 


=  1  + 


cxN- 


2n-M  -  2^ 


1  +  " 


The  analysis  of  the  cluster-buster  case  is  somewhat  more 
complicated.  We  will  not  consider  the  case  where  it  is 
necessary  to  split  a  page  more  than  once.  The  probability 
of  such  an  event  is  so  remote  that,  should  it  occur,  it 
can  be  argued  that  the  hash  function  is  not  acting  randomly, 
in  which  event  none  of  this  analysis  is  valid  anyway.  We 
write  p'j^  for  the  probability  that  when  a  page  is  split 
into  pages  x  and  y  ,  then  page  x  has  exactly  k 
elements .  We  have 


since  entries  from  the  split  page  go  into  page 


probability  —  .  The  expected  number  of  probes 
with  k  elements  is  1  +  . 

Hence 


X  with 
in  a  page 


1 

2 


(1  + 


+ 


(1 


2^-k 
n+1 

J 


is  the  expected  number  of  probes  to  access  an  element 
which  maps  into  a  split  page.  Therefore  the  total 
contribution  from  all  split  pages  is 

2”  r  n  ' 

E.  =  Pr[overflow]  I  p'  ^  d  +  +  (1  + 

^  k=0  ^  2”  2  _ 

5 

=  Pr  [overflow ]-j  I  p' 

^  k=0  ^ 

5 

=  7  Pr [overflow] 

4 

The  contribution  from  the  unsplit  pages  (i.e.,  the  pages 
where  overflow  has  not  occurred)  is  calculated  as  above. 

2"^ 

E  =  (1  -  Pr  [overflow] )  p  (1  +  -777) 

^  k=0  ^  2" 

xaN  aN 

=  (1  -  Pr  [overflow] )(  ^  p,  (1  +  -^)  -  V  Pud  + 

Vk=0  ^  2^  k=2^+l 

^aN  1  aN  ,  aN 

<  (1  -  Pr  [overflow] )(  [  Pu  +  +1  ^  ’^Pk  ~  ^  n  P 

''k=C  ^  2"  k=0  ^  ^  k=2”+l 

=  (1  -  Pr  [overflow] )  (1  +  -  2  Pr  [overflow] ) 

^  2 

Hence 

E  ,  =  E.  +  E 

probe  1  2 

£  1  +  £  -  Pr [overflow] ^  ^  Pr Toverf low] ) 
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Figure  XXVIlI-1  is  a  graph  of  the  term  in  parentheses 
as  a  function  of  a  (using  the  normal  approximation  to 
calculate  Pr [overflow]  and  assuming  a  table  of  256 
512-word  pages).  Note  that  this  term  is  always  positive. 
We  see  that  if  we  either  ignore  improbable  overflow  or 
use  a  cluster-buster  table,  we  require  fewer  probes  than 
the  most  efficient  collision-resolving  method.  To  offset 
this  saving  we  have  for  one  method  the  possibility 
(admittedly  improbable)  of  system  blow-up  because  of 
overflow  and  for  the  other  the  expense  of  using  the 
cluster-buster  table.  In  either  case  we  have  the  expense 
of  accessing  an  indexed  table. 
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In  order  to  more  readily  compare  all  these  methods  we 
shall  calculate  computational  costs  (as  opposed  to  costs 
of  disc  access  which  have  already  been  presented) .  We 
define  primitive  costs  as  follows: 

c  =  cost  of  compare 

Cj^  =  cost  of  computing  a  hash  address  ^  4c 

c  =  cost  of  accessing  a  table  through  an 

index  register  ^  c 

c^  =  cost  of  pseudo-random  number  generator 
^  2c 

Cch  =  cost  of  following  a  chain  ^  3c 

c.  =  cost  of  incrementing  an  address  ^  - 

Each  of  the  following  costs  falls  naturally  into  two 
parts!  (1)  the  cost  of  choosing  a  disc  page  to  bring 
into  core;  (2)  the  cost  of  locating  an  item  within  that 
page.  The  second  of  these  costs  is  enclosed  within  [}  in 
the  following  formulae. 

c^  =  cost  of  random  probe  scatter  index 
=  +  {c  +c+{c  +c  +c) {-i  ln(l-a)“l)} 

X  i  X  QJ 

2c  -  4c(-)  In(l-a) 

~  a 


cost  of  chained  scatter  index 


Oh  +  (o^+c-KOhh+c)  c|)) 


6c  +  2ca 


cost  of  next  empty  place  scatter  index 


°h  * 


6c  + 

4  1-a 


cost  of  paged  scatter  index  ignoring 
overflow 


°h  * 


7c  +  2ca 


cost  of  paged  scatter  index  using  hashed 
cluster-buster  filled  to  density  ^ 


Cj^  +  Cj^  +  c  +  Pr  [overflow] 


(c^+c)-  +  c  +  ic 
h  4  2  X 


+  (1-Pr  [overf  lo^^?3 ) 


+  {0^+0+ (c^h*”’ 


—  -  Pr [overflow] (^+^-^Pr [overflow! ) 
2  2  4  2 
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'''  9c  +  3ca  +c  Prioverflowl  (9Pr  [overflow] -Sa-^) 
“  4 


c.  =  cost  of  paged  scatter  index  using  indexed 

D 

cluster-buster 


<  c.  +  c  +  c  +  Pr  [overflow]  (c+ic„) 
—  fi  X  2  ^ 


+  (l-Pr  [overflow]  )c^  + 


-  Pr  [overflow]  (^“jPr  [overflow] ) 


9c  +  2ca  +  c  Pr  [overflow]  (6Pr  [overflow] -2a- 


These  formulae  are,  of  course,  only  approximations  and  can 
easily  be  obtained  by  considering  the  operations  that  are 
necessary  to  access  an  entry  by  each  method.  Note  that 
they  compute  the  cost  of  accessing  an  entry  which  is  in 
the  table.  The  cost  of  trying  to  access  an  item  not  in 
the  table  is  higher.  The  following  table  gives  these 
costs  as  a  function  of  a  for  the  case  of  a  table  of 
512-word  pages. 


256 


N> 
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Table  3 


a  = 

.5 

.6 

.7 

.8 

.9 

Cj^/C 

7.54 

8.11 

8.88 

10.05 

12.23 

C2/C 

7.00 

7.20 

7.40 

7.60 

7.80 

C3/C 

6.75 

7.13 

7.75 

9.00 

12.75 

C4/C 

8.00 

8.20 

8.  40 

8.60 

8.80 

C5/C 

10.50 

10.80 

11.10 

11.40 

11.62 

ce/c 

10.00 

10.20 

10.40 

10.60 

10.75 

In  most 

systems , 

however , 

disc  access 

;  time 

is  of  greater 

concern  than  computational  cost.  In  that  case,  ignoring 
overflow  or  using  a  cluster-buster  is  the  best  method, 
with  the  next  empty  place  method  close  behind.  We  now 
consider  the  costs  involved  in  organizing  the  data  base 
as  a  hash  table  and  accessing  items  directly.  Items  in 
the  data  base  are  assumed  to  occupy  2^  words,  0£r£n  . 

We  require  2^^^^  pages  to  store  as  many  items  in  a 
scatter  storage  table  as  could  be  accessed  by  2*^  pages 
of  a  scatter  index  table  of  the  same  density.  We  note 
immediately  that  only  values  of  a  close  to  1  can  be 
considered,  unless  we  are  willing  to  waste  huge  amounts 
of  secondary  storage  (e.g.,  for  r=4  ,m=8  ,n=9  , 
and  a  =  .8  ,  we  waste  approximately  51.2  pages  (27K  words) 
using  a  scatter  index  table  and  819.2  pages  (410K  words) 
using  a  scatter  storage  table.  If  items  are  larger,  and 
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in  roal  use  (e.g.,  inverted  lists  in  an  information 
retrieval  system)  they  often  will  be,  the  waste  in  a 
scatter  storage  table  at  any  density  not  close  to  1 
becomes  intolerable. )  If  we  use  either  random  probing 
or  chaining,  the  probability  of  the  (i+1)®^  probe  being 
in  the  same  page  as  the  i^^  probe  is  •  Hence 

the  expected  number  of  accesses  required  is: 


random  probing 


2"''^^-l  1 

In(l-a)-l) 

rp  ^ra+r  a 


chaining 


E  .  =  1  + 
ch  ~m+r  2 


If  the  next  empty  place  method  is  used  we  must  use  a 

more  careful  analysis.  Let  E  =  .  Then  ^  ^ 

2 


.n-r 


IS 


the  minimum  number  of  disc  accesses  required.  Now  (E-1) 


E-1 


,n-r 


^  is  the  remaim  er  of  E-1  divided  by  2’^“^ 
E-1! 


(E-1)  - 


and  hence 


.n-r 


,n-r 


is  the  probability  that 


jn-r 


this  remainder  will  overflow  a  page  boundary,  i.e.,  the 
probability  that  1  more  than  the  minimum  number  of  disc 
accesses  will  be  required.  Hence 
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Note  that  this  is  precisely  the  same  formula  as  derived 
above  where  we  discounted  the  possibility  of  a  search 
going  over  more  than  one  page  boundary. 

We  now  present  a  table  of  these  expected  numbers  of 
accesses  for  the  case  of  m=8,n=9,a=.9,  and 
r  =  4  ,  5  ,  6  ,  7 

Table  4 


r  = 

4 

5 

6 

7 

rp 

2.56 

2.56 

2.56 

2.56 

^ch 

1.45 

1.45 

1.45 

1.45 

^ne 

1,14 

1.28 

1.56 

2.13 

From  Tables  2  and  4  we  observe  that  if  entries  in  the 
data  base  are  of  a  fixed  length  of  32  words  or  less,  and 
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if  it  is  possible  to  waste  a  substantial  piece  of 
secondary  storage  (for  the  case  r  =  4  ,  205X  words  are 
wasted) ,  organizing  the  data  base  as  a  scatter  storage 
table  using  the  next  empty  place  method  of  collision- 
resolution  is  most  efficient  with  respect  to  number  of 
disc  accesses  required.  If  it  is  not  possible  to  waste 
as  much  space,  or  if  entries  are  larger  than  32  words, 
then  organizing  the  data  base  as  a  scatter  storage  table 
using  chaining  is  most  efficient.  (If  a  =  .99  the 
wasted  space  drops  to  20K  words  while  =  1.50  .) 

We  must  note  that  organizing  the  data  base  as  a  scatter 
storage  table  is  possible  only  if  entries  are  of  fixed 
length.  In  general,  unfortunately,  entries  are  of 
variable  unbounded  length,  e.g.,  text,  inverted  lists, 
item- sequenced  lists,  etc.  In  that  case,  of  the  methods 
discussed  so  far,  ignoring  overflov;  or  using  a  cluster- 
buster  table  are  most  efficient  with  respect  to  number  of 
disc  accesses,  each  method  requiring  exactly  2. 

We  now  discuss  a  method  which  combines  features  of  both 
scatter  storage  and  index  tables  and  which  reduces  the 
nxunber  of  disc  accesses.  We  retain  a  scatter  index  table, 
but  instead  of  using  the  entire  page  for  hash  table,  we 
devote  a  portion  to  entries  from  the  data  base.  Since 
the  hash  segments  are  small,  we  can  expect  overflow  to 
be  fairly  probable,  so  it  cannot  be  ignored;  therefore 
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we  use  a  cluster-buster  table  to  prevent  overflow.  It 
is  possible  to  describe  an  organization  where  the  pro¬ 
portion  of  each  page  devoted  to  da^;a  base  items  varies 
from  page  to  page.  However/  in  this  discussion  we  shall 
only  consider  the  case  where  this  proportion  is  fixed. 

A  hash  table  entry  will  either  point  to  an  entry  in  the 
page  or  will  be  the  disc  address  of  the  entry  (again/ 
other  organizations  are  possible/  but  this  is  the  only 
one  we  discuss  here). 

The  proportion  of  each  page  devoted  to  data  base  entries 
and  the  density  of  the  scatter  index  table  depend  on 
several  conflicting  criteria.  One,  it  is  desirable  that 
as  little  data  base  storage  space  as  possible  be  wasted. 
Hence  it  is  desirable  that  there  be  sufficiently  many 
entries  in  the  hash  table  in  each  page  so  that  the 
remainder  of  the  page  be  filled;  i.e.,  either  a  should 
be  large/  or  the  hash  table  section  large,  or  both.  Two, 
since  disc  accesses  are  expensive,  as  many  items  as 
possible  should  be  in  the  page:  i.e.,  the  hash  table 
section  should  be  small.  Three,  core  space  is  expensive 
and  the  cluster-buster  resides  in  core.  Hence,  as  few 
segments  as  possible  of  hash  table  should  overflow;  i.e., 
either  a  should  be  small,  or  the  hash  table  section 
large,  or  both. 
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H  We  now  present  formulae  for  these  criteria. 

S  =  size  of  cluster-buster  table. 

^  2Pr [overflow]^ 

where  Pr [overflow]  is  given  exactly  by  (1)  ond 
approximately  by  (2)  above 


where 


N  =  number  of  places  in  scatter  index  table 
T  =  number  of  places  in  each  segment  of  table 
W  =  space  wasted  in  mixed  pages 
[(2^-T)  -  aTJl']- 

-  .p 

=  I  ip.<l  -  expected  length  of  each 
i=0  ^ 

data  base  entry 

=  probability  that  a  data  base  entry  is  i 
words  long 

E  =  expected  number  of  disc  accesses  required 


=  1  + 


[1-  (^)  { 


Z’ 


) 


=  2 


2*^-T  ^ 
uTi,' 


2 


In  sum,  we  have  discussed  various  solutions  to  the 
problem  of  accessing  a  data  file  too  large  to  fit  into 
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core.  If  the  file  consists  of  fixed  size  items,  then 
organizing  the  data  as  a  hash  storage  table  and 
addressing  it  directly  is  most  economical  with  regard 
to  disc  accesses.  If  the  file  consists  of  variable 
length  items,  then  use  of  a  mixed  scatter  index  table 
with  an  auxiliary  cluster-buster  table  is  most  economical. 


XXIX.  Net  Models  —  Some 

Elementary  Constructs 
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In  previous  sections  we  have  been  satisfied  with  an 
informal  definition  of  batching  and  buffering,  and  we 
have  ignored  the  general  question  of  concurrency  — 
despite  the  fact  that  many  of  the  systems  examined  have 
involved  concurrent  operation.  For  example,  when  a 
subdeck  of  feature  cards  is  placed  in  front  of  a  light 
source,  the  "hits"  are  available  concurrently;  the 
intersection  operation  occurs  concurrently  for  all  card- 
positions.  We  will  introduce  Petri  nets  as  a  repre¬ 
sentational  medium  for  exhibiting  concurrency,  A  brief 
description  of  Petri  nets  and  occurrence  systems  is 
provided  in  Appendix  I. 

Let  us  begin  our  discussion  by  considering  a  simple 
system  consisting  of  four  operations.  When  the  first 
operation  is  completed,  the  second  operation  begins; 
when  the  second  is  completed,  the  third  begins;  when  the 
third  is  completed,  .e  fourth  begins.  These  four 
operations,  thus  constrained,  are  repeated  cyclically. 

We  represent  this  system  with  the  net  ’n  Figure  XXIX-1. 
The  Sequencing  constraints  seem  to  preclude  performing 
any  of  the  operations  concurrently.  If  this  were  the 
case,  the  time  required  for  each  iteration  of  the  cycle 
would  be  equal  to  the  sum  of  the  durations  of  the  four 


operations . 
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Figure  XXIX- 1 


0^  :  operation  1  in  progress 

0^  :  operation  2  in  progress 

0^  :  operation  3  in  progress 

0^  :  operation  4  in  progress 

0^'  ;  operation  1  not  in  progress 

(i.e.,  completed;  not  yet  rebegun) 
O2 '  :  operation  2  not  in  progress 

0^ '  :  operation  3  not  in  progress 

0^'  ;  operation  4  not  in  progress 


However,  Figure  XXIX-2,  which  is  a  repetition  stretch  repre¬ 
senting  four  iterations  of  the  system's  behavior-cycle,  exhibits 
the  fact  that  all  four  operations  may  be  performed  concurrently. 

Figure  XXIX-2 


/ 

I 
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The  net,  which  represents  explicitly  both  the  sequencing 
constraints  and  the  cyclic  behavior  of  the  system,  ex¬ 
hibits  concurrencies  among  operations  from  (what  we  think 
of  as)  different  iterations  of  its  behavior-cycle.  The 
dotted  line  in  Figure  XXIX-2  Indicates  a  time  slice  in 
which  the  n^^  iteration  of  operation  1,  the  n-1®^ 
iteration  of  operation  2,  the  ri-2^^  iteration  of 
operation  3,  and  the  n-3‘^  iteration  of  operation  4 
are  concurrent. 

We  shall  call  the  net  structure  in  Figure  XXIX-1  a 
pipeline.  The  pipeline  in  Figure  XXIX-1  has  four  stages; 
accordingly,  we  would  describe  it  as  a  pipeline  with 
capacity  4.  We  can  view  a  pipeline  variously  as  a  set  of 
ordered  operations  (as  in  the  example  above)  or  as  a 
buffer  or  stack  into  which  values  can  be  placed.  In  the 
latter  interpretation  each  "stage"  or  "pair  of  places" 
might  be  viewed  as  a  storage  cell  capable  of  storing  one 
value.  We  may  think  of  values  as  being  "dropped  in  at 
the  top"  and  transmitted  "down"  the  pipeline.  A  pipeline 
of  capacity  n  will  be  capable  of  holding  n  values 
concurrently.  Suppose  we  were  dealing  with  two  different 
types  of  value  and  we  wished  to  distinguish  between  them. 
We  could  construct  a  bi-valued  pipeline,  or  bit  channel, 
as  in  Figure  XXIX-3. 
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Figure  XXIX- 3 


Each  stage  now  has  three  possible  (mutually  exclusive) 
states:  "empty",  "1",  or  "0".  Note  that  with  such  a 

structure  we  can  transmit  a  sequence  of  bits,  maintaining 
the  order  of  the  values. 
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XXX.  A  Model  of  Buffering 


1  2 


transition  1 
transition  2 
transition  3 
transition  4 


initiation  of  the  process 
termination  of  the  process 
initiation  of  buffered  I/O 
termination  of  buffered  I/O 


place  a 
place  b 
place  c 
place  d 
place  e 
place  f 


:  process  in  progress 
:  process  not  in  progress 
:  I/O  in  progress 
:  I/O  not  in  progress 
:  input  for  process  available 
:  output  of  process  available  for  I/O 


The  occurrence  graph  below  is  based  on  an  initial  case  in 
which  both  the  process  and  the  I/O  are  idle  and  the 
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"input-buffer”  of  the  process  is  full.  Note  that  in 
this  system  the  process  and  the  I/O  will  never  be 
concurrent.  Thus,  if  a  etnd  c  are  the  only  places 
of  significant  duration,  the  minimum  time  for  a  cycle 
in  this  system  is  equal  to  a+c  —  i.e.,  processing 
time  +  I/O  time. 


The  labels  correspond  to  those  used  in  the  model  of  single 
buffering,  with  primes  added  to  show  how  the  double  buffer 
model  is  composed  of  two  single  buffers.  Note  that  places 
b  and  b'  and  places  d  and  d'  guarantee  alternation 
of  transitions  1  and  1'  and  of  transitions  3  and  3'. 


c 


This  occurrence  graph  is  based  on  an  initial  stage  in 
wh3 ch  both  the  process  and  I/O  are  idle  and  both  "input" 
buffers  are  loaded.  In  this  occurrence  graph  the  process 
and  I/O  are  concurrent.  Hence,  if  a  (and  a’  )  and  c 
(and  c'  )  are  the  only  places  of  significant  duration, 
the  minimum  time  for  a  cycle  in  this  system  is 
max  (processing  time,  I/O  time). 

The  buffering  model  is  generalizeedale  to  any  number  of 
buffers.  This  becomes  especially  interesting  when  I/O 
time  is  greater  than  processing  time  and  it  is  possible 
to  perform  several  I/O  operations  con.nirrently .  Consider 
the  case  of  n  concurrently  operable  I/O  devices.  Then 
minimum  cycle  time  would  be  e,  max  (processing  time, 
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X  I/O  time).  The  dual  also  holds:  if  processing 

time  is  greater  than  I/O  time  and  an  execution  of  the 

process  is  not  dependent  on  previous  executions,  the 

availability  of  m  concurrently  operable  processors 

permits  minimum  cycle  time  to  be  'v  max  (—  x  processing 

m 

time,  I/O  time). 
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XXXII.  Pipelined  and  Serial 
Phased  Systems 

In  this  section  we  will  illustrate  the  distinction 
between  pipelined  phased  systems  and  conventional  serial 
phased  systems,  by  comparing  two  systems  which  perform 
the  same  task. 


Both  systems  perform  as  follows: 


/previous  response  is  accepted  \ 
\next  query  is  generated  ? 


query  is  input 


query  is  decoded 


crosf-indexing  accomplished 


file  access  performed  for  hits 


event  1  (and  event 
1'  for  pipelined 
system) 

event  2  initiates  input; 

event  3  completes  input 

event  4  initiates 

decoding; 

event  5  completes 

decoding 

event  6  initiates 

cross-indexing 

event  7  completes 

cross-indexing 

event  8  initiates  file 

access 

event  9  completes  file 
access 


response  records  output 


event  10  initiates  output; 
event  11  completes  output 
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Serial  Phased 


Q  :  query  available 
I  ;  input  channel 
available 

AI  :  input  in  progress 
Q'  :  internal  query 
availcible 

DI  :  dictionary  available 
ADI  :  decoding  in  progress 
Q”  :  query  decoded 
X  :  cross-index  available 
AX  :  indexing  in  progress 
A  :  accession  numbers  of 
hits  available 
DF  :  document  file 
available 

ADF  :  documents  being 
retrieved 

R  :  result  available  for 
input 

0  :  output  channel 

available 

AO  ;  output  in  progress 
R*  :  result  outputted 


SQ  :  space  for  query, 
external 

SQ'  :  space  for  query, 
internal 

SQ"  :  space  for  query, 
decoded 

SA  :  space  for  'hits' 

SR  :  space  for  documents 
retrieved 

SR'  :  external  space  for 
documents 


time  +  output  time) 


Pipeline  Phased 


time,  output  time) 
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Notice  that  a  cycle  relates  to  throughput  capacity; 
in  both  systems  the  time  to  process  one  query  is  the 
same.  The  serial  system  requires  receiving  the  response 
to  the  first  query  before  submitting  the  next.  The 
pipeline  phased  system  can  process  a  number  of  queries 
concurrently.  The  processing  stages  are  staggered  in 
the  pipeline.  The  cycle  time  is  thus  a  measure  of  the 
maximum  rate  at  which  queries  can  be  processed. 
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XXXIII.  A  Model  of  a  Hardware  Device 
—  The  NCR  CRAM  UnTP 

In  this  section  we  illustrate  the  use  of  Petri  nets  in 
modelling  the  synchrony  and  concurrency  characteristics 
of  pseudo-random  access  mass  storage  devices.  The  NCR 
CRAM  Unit  is  a  pseudo-random  access  mass  memory  device. 
The  storage  medium  is  256  oxide-coated  cards.  Each  card 
has  a  set  of  notches  at  one  end,  which  permits  the 
selection,  at  random,  of  any  one  of  the  cards  from  the 
CRAM  magazine.  When  loaded  into  a  CRAM  unit,  the  cards 
hang  from  eight  rods  which  may  be  turned  in  such  a  way 
as  to  release  exactly  one  card.  When  the  card  is  re¬ 
leased,  it  falls  freely  until  it  reaches  a  rotating  drum 
to  which  it  is  pulled  by  means  of  a  vacuum,  and  the  card 
is  accelerated  to  the  surface  speed  of  the  drum.  Shortly 
after  attaining  this  speed,  the  leading  edge  of  the  card 
reaches  the  read-write  heads.  After  reading  or  writing, 
the  card  may  remain  on  the  drum,  to  be  recirculated  past 
the  heads  on  the  next  revolution,  or  it  may  be  released 
and  returned  to  the  magazine. 

Three  photocells  provide  the  prime  source  of  control  of 
the  mechanism.  PE  1  is  located  between  the  return  chute 
and  the  magazine  and  controls  the  operation  of  the  loader 

^See  National  Cash  Register  Company  publi.;ation 
MD  315-101  10-62  for  a  description  of  the  equipment. 
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mechanism  for  the  magazine.  PE  2  signals  the  arrival 
of  a  card,  either  one  which  has  just  been  dropped,  or 
one  which  is  recirculated  on  the  driam.  Reading  or 
writing  must  be  done  before  the  leading  edge  of  the 
card  arrives  at  PE  3.  Here  we  focus  on  certain  character¬ 
istics  of  the  CRAM  unit:  we  have  modelled  the  elements 
in  CRAM  that  relate  to  the  seek  time,  and  have  not 
modelled  the  details  of  writing  or  reading,  nor  the 
reject  mechanism  which  automatically  rejects  a  card  from 
the  drum  after  750ms.  of  total  inactivity.  We  also  have 
not  modelled  those  phenomena  associated  with  individual 
card  identity  (for  instance,  if  you  happen  to  select  a 
card  that  was  just  ejected  from  the  drum,  there  may  be 
an  additional  time  delay  —  the  length  of  time  needed  for 
the  card  to  return  to  the  magazine). 
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1.  Decision  to  select  a  card 

2.  Decision  not  to  select  a  card.  (tVhen  a  card  is  on 

the  drum,  reject  is  triggered  by  the  next  card  selection. 
Being  passive,  i.e.,  not  issuing  a  select,  means  deciding 
not  to  select  a  card  for  each  drum  revolution  that  occurs 
with  a  card  on  the  drum) 

3.  Beginning  of  select 

4.  End  select,  begin  card  drop 

/5.  End  (gate-PE  2),  begin  (PE  2-PE  3) 

6.  End  card  drop,  begin  (PE  2-PE  3)  (E-'ents  5  and  6  re¬ 
present  the  two  alternate  routes  by  which  a  card  passes 
the  read-write  station:  Event  5  represents  recirculation. 
Event  6  represents  new  arrival  from  magazine 

7.  End  (PE  2-PE  3),  begin  (PE  3-gate)  (recirculate) 

'8.  End  (PE  2-PE  3),  begin  (PE  3-gate)  (eject) 

/9.  End  (PE  3-gate),  begin  (gate-PE  2)  (recirculate) 

\10.  End  (PE  3-gate),  begin  (gate-PE  1)  (eject) 

11.  End  (gate-PE  1),  begin  (PE  1-magazine) 

12.  End  (PE  1-magazine) 


a  decision  made  to  select,  select  not  yet  begun 
b  no  decision  yet  for  this  drum  revolution 
c  decision  made  to  select,  card  has  not  yet  fallen 
d  decision  made  not  to  select 
e  no  card  on  drum 
f  selection  occurring 
g  card  dropping 

h  leading  edge  between  PE  2  and  PE  3 

i  leading  edge  between  PE  3  and  gate 

j  card  being  recirculated 
i  leading  edge  between  PE  3  and  gate 

m  leading  edge  between  gate  and  PE  1 

n  card  entering  magazine 
p  no  card  in  return  chute 
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XXXIV.  A  Highly  Concurrent  Net  Model 
of  tne  Cross* Indexing  Grid 

In  Section  I  we  presented  a  model  of  cross-indexing 
information.  The  model  consisted  in  a  grid:  horizontal 
lines  represented  items,  and  vertical  lines  represented 
descriptors;  a  given  intersection  j,k  was  circled  if 
and  only  if  descriptor  j  applied  to  item  k  .  Note, 
however,  that  this  representation  is  static  —  it 
represents  the  cross-indexing  information  (i.e.,  the 
set  of  descriptor-item  relations)  used  in  performing  a 
query  (or  update) ,  but  it  does  not  represent  the  actual 
proces!:  of  perf'rming  a  query  (or  update).  In  this  section, 
then,  we  will  develop  a  Petri  net  grid  model  of  cross¬ 
indexing  which  represents  the  process  of  query  performance. 
(It  will  become  clear  that  the  model  can  be  expanded  to 
represent  the  performance  of  updates.)  This  model  will 
exhibit  possioilities  for  concurrency  which  may  be  exploited 
by  batching,  buffering,  or  pipelining. 

In  Section  XXIX  we  introduced  a  net  me  of  a  bit  channel. 
Roughly  speaking,  the  net  model  of  the  cross-indexing  grid 
is  constructed  by  replacing  each  horizontal  and  each 
vertical  line  in  the  grid  with  such  a  bit  channel.  We 
may  think  of  the  vertical  channels  as  transmitting  upward 
and  of  the  horizontal  channels  as  trans -litting  from  right 
to  left.  A  query  is  made  by  selecting  a  subset  of  the 
descriptors.  In  the  net  model,  then,  a  query  will  he  made 
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by  supplying  a  value  to  each  of  the  vertical  channels  as 
follows:  a  ”1"  is  supplied  if  the  corresponding  des¬ 
criptor  is  in  the  query  set;  a  "0"  is  supplied  if  it  is 
not.  As  a  value  is  transmitted  up  a  vertical  channel,  a 
"copy"  of  it  is  "deposited"  at  each  intersection.  Further¬ 
more,  at  each  intersection  j,k  a  value  is  already  stored 
as  follows:  a  "1"  if  descriptor  j  applied  to  item  k 
(i.e.,  if  the  intersection  is  circled);  a  "0"  if  not. 

Each  of  the  horizontal  channels  generates  "I's"  con¬ 
tinuously  from  its  right  end  and  transmits  them  left¬ 
ward;  as  a  "1"  is  transmitted  leftward  it  may  be  trans¬ 
formed  into  a  "0"  as  a  function  of  the  state  of  one  of  the 
intersections  it  encounters.  If  the  value  which  reaches 
the  left  end  of  a  given  horizontal  channel  is  a  "1",  then 
the  corresponding  item  satisfies  the  query:  if  it  is  a 
"0",  the  item  does  not  satisfy  the  query.  Let  us  call 
the  bit  stored  at  a  given  intersection  the  "cross-indexing 
bit",  the  bit  received  from  the  vertical  channel  the  "query 
bit",  the  bit  received  from  the  horizontal  channel  the 
"incoming  response  bit",  and  the  bit  transmitted  leftward 
(to  the  next  intersection)  the  "outgoing  response  bit". 

We  can  then  describe  the  logic  at  an  intersection  j,k 
as  follows:  if  the  incoming  response  bit  is  a  "0",  then 
the  outgoing  response  bit  will  be  a  "0"  (i.e.,  it  has 


already  been  determined  that  item  k  does  not  satisfy 
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the  query);  if  the  incoming  response  bit  is  a  “1"  and  the 
query  bit  is  a  "0",  the  outgoing  response  bit  is  a  "1" 
(i.e.,  descriptor  j  is  not  in  the  query  set);  if  the 
incoming  response  bit  is  a  "1"  and  the  query  bit  is  a  "1" 
and  the  cross-indexing  bit  is  a  "0",  then  the  outgoing 
response  bit  is  a  "0"  (i.e.,  j  is  in  the  query  set  and 
it  does  not  apply  to  k  ) ;  if  the  incoming  response  bit, 
the  query  bit,  and  the  cross-indexing  bit  are  all  "I's", 
then  the  outgoing  response  bit  is  a  "1"  (  j  is  in  the 
query  set  and  applies  to  k  ) .  An  additional  vertical 
channel  is  provided  at  the  left  edge  of  the  (/rid  model 
for  "reading  out"  the  results  of  a  query. 

Because  the  various  elements  of  the  net  model  intersect 
each  other,  we  will  not  try  to  represent  an  entire  net 
grid  pictorially.  Instead  we  present  the  elements 
individually  below; 
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Figure  XXXIV- 1 

Vertical  Channel  for  Descriptor  j .  The  value  of  des¬ 
criptor  k  for  each  query  is  "tapped  off"  at  each  inter¬ 
section. 


query 

initiation 
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Figure  XXXIV- 3 
Intersection  jA 

R-  j^=0  :  it  is  already  known  that  item  k  does  not 
satisfy  the  query 

v=l  ;  thus  far  item  k  satisfies  the  query 

Q  i,=0  :  descriptor  j  is  not  in  the  query  set 

Q.  =1  :  descriptor  j  is  in  the  query  set 

j,k=0  :  descriptor  j  does  not  apply  to  item  k 

j,k=l  :  descriptor  j  applies  to  item  k 
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IJote  that  the  elements  of  our  net  model  of  the  cross¬ 
indexing  grid  are  pipelines  so  that  it  is  capable  of 
highly  concurrent  operation.  The  number  of  queries 
which  can  be  evaluated  concurrently  is  equal  to  I+D  , 
where  I  =  the  number  of  items  in  the  system  and 
D  =  the  number  of  descriptors  in  the  system.  The  processing 
time  for  one  query  —  i.e.,  the  time  between  initiation  of 
a  given  query  and  completion  (i.e.,  response  read-out)  of 
that  query  —  will  be  equal  to  C(I+D)  ,  where  C  is  a 
constant..  However,  the  throughput  rate  will  be  equal  to 
C  .  That  is,  if  queries  can  be  input  at  a  sufficient 
rate  (i.e.,  approaching  C  ),  the  time  between  successive 
outputs  will  approach  C  . 
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APPENDIX  I 
Petri  Nets  ^ 


Formally,  a  Petri  net  is  a  directed  graph  with  two  kinds 
of  nodes:  places ,  represented  as  circles;  and  transitions , 
represented  as  line  segments.  Each  directed  arc,  represented 
as  an  arrow,  connects  one  place  with  one  transition.  An 
arrow  from  a  place  to  a  transition  means  that  the  place  is 
an  input  to  the  transition;  an  arrow  from  a  transition  to 
a  place  means  that  the  place  is  an  output  of  the  transition. 
Every  place  in  a  net  is  an  output,  of  at  least  one  transition 
and  an  input  to  at  least  one  transition.  No  place  may  be 
both  an  input  to  and  an  output  of  the  same  transition. 

A  place  is  capable  of  two  states:  full  or  empty .  The 
state  of  a  net  is  given  by  a  list  of  all  its  full  places. 

A  transition  may  fire  if  and  only  if  all  of  its  inputs  are 
full.  When  a  transition  fires,  all  of  its  inputs  are 
emptied  and  all  of  its  outputs  are  filled.  If  some  place 
is  input  to  two  or  more  transitions,  all  of  whose  inputs 
are  full,  these  transitions  are  in  conflict.  Only  one  of 
the  transitions  --  any  one  —  may  fire  in  such  a  situation. 
(See  Figures  A,  B,  and  C  for  examples  of  net  diagrams. 

Figure  B  shows  a  net  with  conflict. ) 


'For  a  comprehensive  account  of  Petri  nets  we 
refc  the  reader  to  the  "Final  Report  for  the  Information 
System  Theory  Project",  RADC  Contract  #  AF  30  (602 ) -4211 , 
by  Dr.  Anatol  W.  Holt  et  al. 


A  net  and  an  occurrence-grap 
The  shaded  places  are  full, 
time  slices  of  the  o-graph. 


A  net  with  conflict  and  the 
basis.  When  A,  B,  and  C  ar 
or  transition  2  fires,  but 
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Figure  C 

IL  :  Ball  1  is  moving 
counter-clockwise . 


IR  :  Ball  1  is  moving 
clockwise. 

2L  :  Ball  2  is  moving 
counter-clockwise . 

etc. 


In  using  Petri  nets  to  describe  a  system,  each  place  is 
associated  with  a  proposition  about  the  system.  By 
interpretation,  when  a  place  is  full,  the  proposition 
associated  with  it  is  true.  In  other  words,  the  condition 
described  by  a  proposition  holds  in  the  system  when  the 
associated  place  is  full.  The  state  of  a  system  described 
by  a  given  state  of  its  net  is  the  conjunction  of  the 
propositions  associated  with  the  full  places.*  Thus  a 
net  diagram  together  with  a  suitable  initial  assignment 


*It  is  perhaps  misleading  to  speak  of  "system  states" 
here  since  a  net  does  not  necessarily  define  a  totally 
ordered  sequence  of  states.  (Formally,  this  is  because 
some  transitions  may  fire  concurrently  -  that  is,  their 
firings  are  not  temporally  ordered. )  In  this  respect, 
nets  differ  fundamentally  from  state  machines. 
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of  place  states  (corresponding  to  the  conditions  which 
hold  in  the  system  initially)  makes  possible  a  formal 
simulation  of  the  behavior  of  the  corresponding  system. 

Note  that  it  is  the  occupancy  of  places  which  is  viewed 
as  having  duration.  Transitions  merely  bound  places; 
the  firing  of  a  transition  is  not  viewed  as  time-consuming 
—  rather,  it  is  a  separation  of  distinct  place  occupancies. 
Hence,  the  propositions  associated  with  places  describe 
conditions  involving  time-consuming  operations  or  states. 
Figure  C,  for  example,  is  a  net  representation  of  four 
balls  moving  and  colliding  on  a  single-lane  circular  track. 

The  propositions  describing  the  system  are  all  of  the 
form:  "ball  n  is  moving  clockwise  (or  counter-clockwise)". 

We  may  view  an  occurrence-graph ,  or  o-graph,  as  a  directed 
graph  which  represents  a  simulation  history  of  some  net. 
Formally,  an  o-graph  consists  of  vertices ,  arcs ,  and 
labels  associated  with  the  arcs.  Each  label  corresponds 
to  some  condition  of  the  system  being  represented.  (The 
words  label  and  condition  are  therefore  used  interchangeably 
in  this  context.)  Each  arc  represents  an  interval  of 
place  occupancy  (or  condition  holding);  the  place  (and 
hence  the  condition)  is  designated  by  the  label  associated 
with  the  arc.  An  inner  vertex  represents  a  transition 
firing  and  hence  an  occurrence  in  the  system  being  represented 
(The  terms  inner  vertex  and  occurrence  are  accordingly 
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used  interchangeably.)  Thus  an  occurrence  may  be  described 
as  follows:  the  conditions  of  the  input  arcs  cease  to 
hold  (the  input  places  become  empty);  the  conditions  of 
the  output  arcs  begin  to  hold  (the  output  places  become 
full).  (See  Figures  A,  B,  and  D  for  examples  of  o-gra?hs.) 

Two  occurrences  are  said  to  be  temporally  ordered  if  and 
only  if  there  is  a  path  from  one  to  the  other;  the  former 
precedes  the  latter.  Note  that  some  occurrence  pairs  in 
an  o-graph  are  temporally  ordered  while  others  are  not. 
Occurrences  which  are  not  ordered  are  said  to  be  con¬ 
current.  Similarly,  two  arcs  are  temporally  ordered  if 
and  only  if  there  is  a  path  from  one  to  the  other;  arcs 
which  are  not  temporally  ordered  are  concurrent.  A 
time-slice  is  a  maximal  set  of  pairwise  concurrent  arcs. 

A  time-slice  represents  a  possible  state  of  the  net  (and 
hence  of  the  system)  during  the  history  which  the  o-graph 
describes.  (See  Figure  A.) 


(two  balls  moving  clockwise  and  two  counter-clockwise) 


(three  balls  moving  counter-clockwise  and  one  c  .ockwise) 


IR 


IL 
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An  o-graph  may  be  decomposed  at  a  time-slice.  Two  o-graphs 
may  be  composed  if  the  terminal  conditions  of  one  are 
identical  to  the  initial  conditions  of  the  other.  An 
o-graph  whose  initial  and  terminal  conditions  are 
identical  is  termed  an  o-cycle.  An  o-graph  formed  by 
composing  some  number  of  copies  of  an  o-cycle  is  termed 
a  repetition  stretch  of  the  o-cycle.  o-cycle  which 
cannot  be  decomposed  into  further  o-cycles  is  termed  an 
irreducible  o-cycle.  (The  o-graphs  shown  in  Figures  A, 

B,  and  D  are  all  irreducible  o-cycles.)  For  every  net 
together  with  a  suitable  assignment  of  place  states, 
there  is  at  least  one  basis ,  consisting  of  a  finite  set 
of  irreducible  o-cycles  from  which  every  possible 
simulation  history  may  be  generated  by  composition  and 
decomposition.  If  the  net  contains  no  conflict,  its  basis 
consists  of  one  irreducible  o-cycle.  Note  that  a  given 
net  diagram  may  be  capable  of  several  different  disjoint 
behaviors  given  different  initial  place  assignments. 

Figure  D,  for  example,  shows  the  bases  for  the  three 
different  behaviors  of  which  the  net  in  Figure  C  is 
capable. 
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